scExplorer Guide

Introduction to Single-Cell RNA Sequencing

Biological Foundation

Single-cell RNA sequencing (scRNA-seq) enables the quantification of the transcriptome of individual cells, revealing the heterogeneity that conventional bulk RNA-seq averages out. In any tissue, seemingly homogeneous populations harbor distinct transcriptional states: progenitor cells co-exist with terminally differentiated progeny, activated immune cells reside alongside quiescent counterparts, and rare populations (stem cells, circulating tumor cells) that constitute <1% of the sample become identifiable.

Capture technologies

Droplet-based platforms (10x Genomics Chromium, Drop-seq, inDrop) encapsulate individual cells in nanoliter droplets with barcoded beads, enabling high throughput (thousands to tens of thousands of cells) at the cost of shallow sequencing depth per cell (~1,000–3,000 genes/cell). Plate-based methods (Smart-seq2, MARS-seq) sort cells into microtiter plates via FACS, providing full-length transcript coverage and deeper sequencing (~4,000–8,000 genes/cell) but at lower throughput (hundreds of cells). The choice of platform directly impacts downstream analysis: deeper sequencing reduces dropout events and improves detection of lowly expressed transcription factors.

The count matrix

The fundamental data structure in scRNA-seq is a genes × cells matrix where each entry represents the number of unique molecular identifiers (UMIs)Short random sequence appended during library prep to tag individual mRNA molecules, enabling PCR duplicate removal. or read counts for a gene in a cell. This matrix is characteristically sparse: typically 80–95% of entries are zero. These zeros arise from a combination of biological absence (the gene is genuinely not expressed in that cell) and technical dropoutFailure to detect an expressed transcript due to stochastic mRNA capture. Results in excess zeros in the count matrix. (the transcript was present but not captured due to the stochastic nature of mRNA capture and reverse transcription). Distinguishing between these two sources of zeros remains one of the central challenges in scRNA-seq analysis.

Because zero values can reflect both molecular undersampling and genuine biological heterogeneity, downstream analysis has diverged into two broad strategies. One strategy applies denoising or imputation to recover signal that may be masked by sparse sampling, particularly when the objective is to stabilize low abundance transcripts, recover regulatory programs, or improve gene-gene structure. This is the rationale behind methods such as MAGIC, SAVER, and scImpute. A second strategy avoids explicit imputation and models the observed counts directly, especially in UMI based droplet datasets where many zeros are compatible with standard count sampling. In this framework, excessive smoothing may blur cell state boundaries, inflate correlations, and distort downstream inference.

Current evidence indicates that the value of imputation is task dependent rather than universal. Svensson showed that droplet scRNA-seq data are often not zero inflated beyond standard count models, whereas Qiu emphasized that dropout patterns can themselves contain biologically informative structure. A systematic benchmark by Hou et al. further showed that imputation may improve some tasks, such as agreement with bulk profiles, while providing limited or inconsistent benefit for clustering and trajectory inference. Accordingly, imputation is best treated as an optional analytical layer for visualization or exploratory denoising, whereas conclusions about marker genes, differential expression, trajectories, or regulatory networks should ideally be confirmed in the observed count space or in a parallel non-imputed analysis.

References

Practical consideration

The quality of the input suspension is the single most critical determinant of data quality. Cell viability >85%, absence of clumps (which generate doublets), and gentle dissociation protocols (to minimize transcriptional stress responses such as induction of immediate early genes like FOS and JUN) are essential before any computational analysis begins.

0 Upload

Data Structures and Input Formats

AnnData (h5ad)

The AnnData format (used by Scanpy) organizes single-cell data into a structured object: X stores the count matrix (cells × genes), obs contains cell-level metadata (sample origin, QC metrics, cluster assignments), var holds gene-level annotations (gene symbols, Ensembl IDs, mitochondrial flags), obsm stores embeddings (PCA coordinates, UMAP), and layers can hold alternative representations (raw counts, normalized values, spliced/unspliced for velocity). This hierarchical structure ensures all analytical outputs remain bound to the original data, maintaining full provenance.

10x Genomics h5 (CellRanger output)

CellRanger produces filtered feature-barcode matrices in HDF5 format. The filtering step uses an algorithm based on the barcode rank plot (the "knee plot") to distinguish genuine cell-containing droplets from empty droplets containing ambient RNA. The h5 file contains the sparse count matrix, gene IDs (both Ensembl and symbol), and barcode sequences.

Seurat objects (RDS)

R-based workflows store data as Seurat objects serialized in RDS format. scExplorer can ingest these and convert them internally to AnnData for processing.

Gene ID mapping

Gene identifiers matter for downstream functional annotation. Ensembl IDs (ENSG00000141510) are stable across genome builds, while gene symbols (TP53) are more human-readable but can be ambiguous (multiple Ensembl IDs mapping to the same symbol). For mitochondrial gene detection, scExplorer uses the gene symbol prefix: MT- for human (e.g., MT-CO1, MT-ND1) and mt- for mouse (e.g., mt-Co1, mt-Nd1). Incorrect species assignment will result in zero mitochondrial genes detected and an unreliable QC metric.

1 Preprocessing & Quality Control

Biological Foundation

Quality control in scRNA-seq is fundamentally a biological problem: distinguishing intact, viable cells from damaged cells, empty droplets, and doublets. Each type of artifact has a specific molecular signature rooted in cell biology:

Empty droplets and low-complexity barcodes

In droplet-based platforms, a fraction of droplets fail to capture a cell but still contain ambient RNA released from lysed cells during sample preparation. These barcodes show characteristically low gene counts and total UMIs. The min_genes threshold removes these, but the appropriate cutoff depends on cell type: red blood cells and platelets naturally express very few genes (~200–500), whereas hepatocytes and neurons routinely express >4,000. Setting min_genes too high risks eliminating genuine low-complexity cell types.

Mitochondrial content as a viability marker

When the plasma membrane loses integrity (apoptosis, mechanical damage during dissociation), cytoplasmic mRNA leaks out of the cell. However, mitochondrial transcripts, protected by the double mitochondrial membrane, are retained proportionally. The result is a cell with an artificially elevated fraction of mitochondrial reads. The mito_threshold parameter sets the maximum tolerable proportion. Typical thresholds: 5% for brain tissue (neurons have low mitochondrial content), 10–15% for PBMCs, up to 20% for metabolically active tissues (heart, kidney) and solid tumors where dissociation-induced damage is common.

Doublet detection

Doublets occur when two cells are captured in the same droplet (~2–8% rate, scaling quadratically with loading concentration). Computational methods like Scrublet simulate artificial doublets by averaging pairs of transcriptomes, then score each real barcode by its similarity to these synthetic doublets. Heterotypic doublets (two different cell types) are detectable because they occupy intermediate positions in transcriptomic space; homotypic doublets (same cell type) are essentially undetectable computationally (Figure 2).

Figure 2: Quality control filtering strategy in scRNA-seq. Three classes of artifacts are identified and removed: empty droplets (low gene count, filtered by min_genes), damaged cells (elevated mitochondrial RNA fraction, filtered by mito_threshold), and doublets (hybrid transcriptomes from two co-captured cells, detected by Scrublet). Only barcodes passing all three filters are retained as viable cells for downstream analysis.

Key Parameters

Parameter	Function	Default	Range	Biological Impact
`min_genes`	Minimum genes detected per cell	200	100–500	Low values retain empty droplets; high values remove genuine low-complexity cells (erythrocytes ~200, platelets ~300). Check the gene count distribution for bimodal separation between empties and cells.
`min_cells`	Minimum cells expressing a gene	3	1–10	Removes genes detected in very few cells: likely technical noise or ambient RNA contamination. Setting to 1 retains potential rare markers; setting >10 risks losing genes specific to very small populations.
`mito_threshold`	Maximum % mitochondrial reads	20%	5–30%	Tissue-dependent: brain (5%), PBMCs (10–15%), solid tumors (20%), cardiomyocytes (up to 30% due to high mitochondrial content). Plot the distribution before choosing.

Artifact alert

Aggressive QC filtering can systematically remove biologically relevant populations. Activated T cells upregulate gene expression and may appear as outliers in gene count distributions. Apoptotic cells undergoing programmed cell death in vivo (not dissociation artifacts) may show moderately elevated mitochondrial content. Always examine QC metric distributions before setting hard thresholds.

References

2 Embedding & Clustering

Highly Variable Gene Selection

Not all genes carry equal information about cellular identity. Housekeeping genes (GAPDH, ACTB) are expressed at similar levels across cell types and contribute noise rather than signal to clustering. Highly variable gene (HVG) selectionGenes whose variance exceeds technical expectation, indicating biological relevance for distinguishing cell types. identifies genes with biological variance exceeding the technical (Poisson) expectation. These genes define cell type identity and state transitions. Typical selections retain 2,000–5,000 HVGs, capturing major identity markers while excluding uninformative genes that would dilute cluster resolution.

Dimensionality Reduction: PCA

Principal Component Analysis performs a linear transformation of the expression matrix to identify orthogonal axes (principal components) that capture decreasing amounts of variance. In scRNA-seq, the first 10–50 PCs typically capture biologically meaningful variance, including cell type identity, activation state, and cell cycle phase, while later PCs are dominated by technical noise.

The elbow plot (variance explained per PC) helps identify this transition: the point where the curve flattens marks the boundary between signal and noise. Using too few PCs risks losing subtle biological variation (e.g., functional states within a cell type); using too many introduces noise that can fragment clusters artificially. The full dimensionality reduction pipeline is illustrated in Figure 3.

Practical note

The number of informative PCs scales with dataset complexity. A PBMC dataset with 5–8 major cell types may be well-captured by 15–20 PCs. A developing organ with continuous differentiation trajectories and dozens of transitional states may require 30–50 PCs to preserve fine structure.

Figure 3: Dimensionality reduction pipeline in scExplorer. Starting from the full gene expression space (~20,000 genes), highly variable gene (HVG) selection retains 2,000–5,000 informative genes. PCA performs linear reduction to ~15–50 principal components concentrating biological signal. A k-nearest neighbor (k-NN) graph encodes cell-cell similarity and serves as the substrate for Leiden clustering. Finally, UMAP projects the graph into two dimensions for visualization.

k-NN Graph & UMAP

k-Nearest Neighbor graph

The k-NN graphk-Nearest Neighbor graph. Network connecting each cell to its k most similar neighbors in PC space. connects each cell to its k nearest neighbors in PC space, creating a network where edges represent transcriptomic similarity. This graph is the substrate for both clustering and UMAP projection. The n_neighbors parameter controls the locality of connections. Low values (5–15) preserve fine local structure, useful for identifying rare subpopulations or transitional states, while high values (30–100) capture broader relationships between cell populations at the cost of blurring subtle distinctions.

UMAP projection

UMAP (Uniform Manifold Approximation and Projection)Non-linear dimensionality reduction for visualization. Preserves local topology; global distances are not meaningful. is a non-linear dimensionality reduction that projects the neighbor graph into two dimensions for visualization. It preserves local topology (cells that are neighbors in high-dimensional space remain close in UMAP) but does not preserve global distances. Two clusters separated by a large gap in UMAP are not necessarily more transcriptomically different than two adjacent clusters. The relative position, distance, and size of clusters in UMAP can vary across random seeds. UMAP is a visualization tool, not an analytical one: all quantitative analyses (clustering, DEA, trajectory inference) operate on the graph or PC space directly.

Common misinterpretation

UMAP distances between clusters are NOT proportional to biological distance. A compact cluster in UMAP is not necessarily more homogeneous than a diffuse one. Avoid over-interpreting the visual layout: validate observations using quantitative methods (marker expression, DEA, connectivity analysis).

Leiden Clustering

The Leiden algorithmCommunity detection algorithm that partitions the k-NN graph into clusters by optimizing modularity with guaranteed connectivity. detects communities in the k-NN graph by optimizing a modularity-based objective function. It partitions the graph into groups of cells with dense internal connections (transcriptomically similar) and sparse connections between groups. Unlike the earlier Louvain algorithm, Leiden guarantees that each resulting community is connected: avoiding the generation of disconnected cluster fragments.

The resolution parameter is the primary control for clustering granularity. At low resolution (0.2–0.5), the algorithm identifies broad lineages: all T cells in one cluster, all myeloid cells in another. At intermediate resolution (0.5–1.0), major subtypes emerge: CD4+ and CD8+ T cells separate, classical and non-classical monocytes distinguish. At high resolution (1.0–2.0+), functional states within subtypes become visible: naive vs memory CD4+, Th1 vs Th2 vs Th17 vs Treg. However, excessive resolution fragments continuous biological variation into discrete clusters that can generate misleading differential expression results.

Parameter	Function	Default	Range	Biological Impact
`n_neighbors`	Neighbors per cell in k-NN graph	15	5–100	Low: fine local structure (rare subsets); High: global relationships (lineages). For datasets with expected rare populations (<1%), use lower values.
`n_pcs`	PCs used for graph/UMAP	30	10–50	Guided by elbow plot. Too few loses fine variation; too many introduces noise. Complex tissues (developing organs) benefit from more PCs.
`resolution`	Leiden clustering granularity	1.0	0.1–3.0	0.3: major lineages (T, B, myeloid). 1.0: subtypes (CD4, CD8, NK). 2.0+: states (Th1, Th17, Treg). Iterate and validate with known markers.
`min_dist`	UMAP minimum distance	0.5	0.0–1.0	Visual only. Low values create tighter clusters; high values produce more uniform spread. Does not affect analytical results.

References

3 Differential Expression Analysis

Biological Foundation

Differential expression analysis identifies genes whose expression differs significantly between clusters, enabling cell type annotation. Each cluster is compared against all other cells (one-vs-rest) to find marker genes: transcripts that are both highly expressed within the cluster and absent or low in others. Ideal markers combine high fold change (magnitude of upregulation) with high specificity (expression restricted to the cluster of interest). A gene upregulated 10-fold but expressed in multiple clusters is a poor marker; a gene upregulated 3-fold but exclusive to one cluster is far more informative for identity assignment.

Statistical Methods

Wilcoxon rank-sum test

A non-parametric test that compares the rank distributions of expression values between two groups. It makes no assumptions about the underlying distribution, making it robust for scRNA-seq data which is zero-inflated and right-skewed. Recommended as the default choice for most analyses.

Welch's t-test

A parametric test assuming approximately normal distributions. It is more statistically powerful (lower false negative rate) when the normality assumption holds: which improves after log-normalization and for moderately to highly expressed genes. However, for sparse genes with many zeros, the normality assumption breaks down and false positive rates increase.

Logistic regression

Models the probability that a cell belongs to a given cluster as a function of each gene's expression. The regression coefficient reflects each gene's discriminative capacity. This approach naturally accounts for the binary nature of classification and can be more robust to outlier expression values. Particularly useful for large datasets where computational efficiency matters.

Interpreting results: the volcano plot

Volcano plots display statistical significance (−log₁₀ p-value, y-axis) against effect size (log₂ fold change, x-axis), as shown in Figure 4. Genes in the upper right quadrant (high significance, strong upregulation) are the primary marker candidates. Genes with high fold change but low significance may be expressed in only a few cells (noisy). Genes with high significance but low fold change are broadly but weakly differentially expressed: potentially less useful for identity assignment but potentially interesting for regulatory analysis.

Figure 4: Anatomy of a volcano plot for differential expression analysis. Each dot represents a gene, plotted by its log₂ fold change (x-axis, effect size) against −log₁₀ adjusted p-value (y-axis, statistical significance). Dashed lines indicate typical thresholds (|log₂FC| > 1, p < 0.05). Red dots (upper right) are significantly upregulated marker candidates with both strong effect size and high confidence. Blue dots (upper left) are significantly downregulated genes. Grey dots (center/bottom) fail to meet significance or effect-size thresholds.

Limitations & Considerations

Over-clustering artifacts: If resolution is set too high, a continuous population (e.g., a gradient of monocyte activation) may be split into multiple clusters. DEA will then identify genes that differ along this gradient as "markers," but they represent arbitrary cut-points in a continuum rather than biologically discrete populations. If the top DEGs between two clusters are graded rather than binary, consider merging them.

Compositional effects: In one-vs-rest comparisons, a gene can appear as a "marker" for a rare cluster simply because it is absent in the dominant cluster type. Always verify that markers are specifically enriched in the target cluster, not just absent elsewhere.

Multiple testing: With thousands of genes tested across multiple clusters, false discovery correction (Benjamini-Hochberg) is essential. Use adjusted p-values for interpretation.

References

4 Gene Expression Visualization

Overlaying Expression on Embeddings

This module projects the expression of individual genes onto UMAP or PCA coordinates, enabling visual validation of cluster identity with known markers. By coloring each cell according to its expression level of a gene, you can assess whether marker genes co-localize with expected clusters.

Canonical marker validation

Cell type annotation typically starts with established markers: CD3E (T lymphocytes), MS4A1/CD20 (B lymphocytes), CD14/LYZ (classical monocytes), FCGR3A/CD16 (non-classical monocytes), NKG7/GNLY (NK cells), PECAM1/CD31 (endothelial cells), COL1A1 (fibroblasts), EPCAM (epithelial cells). Expression should be restricted to the expected cluster; diffuse expression across multiple clusters suggests either poor resolution or the gene is not a good discriminator for that dataset.

Visualization is qualitative

UMAP-based expression plots are useful for quick assessment but should not replace quantitative analysis. Color intensity in UMAP is affected by cell density (dense regions appear brighter), and UMAP distortions can visually merge or separate expression domains. For quantitative comparison of expression levels across clusters, use violin plots, dot plots, or the Heatmap module.

5 Heatmap

Matrix Visualization of Gene Expression

Heatmaps display the expression of selected genes (rows) across clusters or individual cells (columns), providing a comprehensive view of marker specificity and co-expression patterns. They are the standard method for presenting the transcriptional signature of each cluster in publications.

Z-score normalization (row scaling)

When enabled, each gene's expression is transformed to a z-score across clusters: values represent standard deviations from the gene's own mean. This is essential when comparing genes with vastly different absolute expression levels: a transcription factor expressed at 50 counts and a ribosomal gene at 10,000 counts become comparable when scaled. The color then reflects relative enrichment or depletion within each gene's own distribution.

Interpretation patterns

Block-diagonal pattern: When marker genes are correctly ordered, each cluster should show a distinct block of upregulated genes: the diagonal blocks indicate well-separated populations (Figure 9). Shared expression blocks: Two clusters sharing the same upregulated gene modules may represent subclusters of the same cell type that should be merged. Gradient expression: Genes showing progressive change across multiple clusters suggest a differentiation trajectory rather than discrete populations.

Figure 9: Schematic heatmap showing the ideal block-diagonal pattern for well-separated cell populations. Rows represent marker genes (italicized) and columns represent Leiden clusters (C0–C4). Each cluster displays high expression (red) for its characteristic markers and low expression (blue) elsewhere. Dashed outlines highlight the diagonal blocks corresponding to cluster-specific gene signatures: CD3E/CD4 for T cells (C0), MS4A1/CD79A for B cells (C1), CD14/LYZ for monocytes (C2), NKG7/GNLY for NK cells (C3), and PPBP/PF4 for platelets (C4). Off-diagonal signals (faint red) indicate shared expression across related populations.

6 Downstream Analyses

Downstream analyses extend beyond cell type identification to address dynamic and regulatory questions: How do cells transition between states? What directionality does this trajectory have? Which transcription factors drive cell identity? How do cell populations communicate? scExplorer integrates four specialized modules to address these questions.

Trajectory Inference

Biological Foundation

Many biological processes involve continuous transitions between cellular states: hematopoietic differentiation from stem cells through progenitors to mature blood cells, epithelial-to-mesenchymal transition in development and cancer, T cell activation and exhaustion gradients. Trajectory inference reconstructs these continuous processes from snapshot scRNA-seq data by ordering cells along inferred differentiation paths.

PAGA (Partition-based Graph Abstraction)

PAGA quantifies the statistical connectivity between Leiden clusters in the k-NN graph, generating a coarsened abstraction of the data topology. If two clusters share many inter-cluster edges (more than expected by random), PAGA assigns a high connectivity weight: indicating a probable biological transition. The resulting graph reveals which populations are directly connected (e.g., HSC → CMP → GMP → monocyte) and which are disconnected (e.g., T cells and epithelial cells in a tumor sample should not show strong connectivity).

DPT (Diffusion Pseudotime)

DPT measures the distance between cells using a diffusion process on the neighbor graph. Starting from a user-defined root cell (which should correspond to the biologically most undifferentiated state), DPT assigns each cell a pseudotimeContinuous variable ordering cells along an inferred trajectory. Not real time: represents transcriptional progression. value representing its progress along the differentiation trajectory. Unlike simple path-based methods, DPT can handle branching: identifying bifurcation points where progenitors commit to different lineage fates.

Key limitation

Trajectory inference assumes that the sampled cells represent a continuum of states. It is inappropriate for datasets composed of discrete, non-transitioning populations (e.g., a co-culture of fibroblasts and neurons). In such cases, any inferred trajectory is a computational artifact connecting populations that have no biological transition path.

Choosing the root

The root cell/cluster should be the most biologically undifferentiated population. In hematopoiesis, this is the HSC cluster. In tumor immune infiltrate, there may be no single root: in this case, trajectory analysis may not be appropriate, or multiple trajectories should be inferred independently for each lineage.

References

RNA Velocity (scVelo)

Biological Foundation

RNA velocity leverages the kinetics of mRNA processing to infer the future transcriptional state of each cell. Newly transcribed pre-mRNA retains intronic sequences (unspliced), which are progressively removed to generate mature mRNA (spliced). In steady state, the ratio of unspliced to spliced transcripts is constant. Deviations from this equilibrium are informative:

Excess of unspliced RNA → the gene is being actively upregulated (transcription rate exceeds degradation of unspliced intermediates).
Excess of spliced RNA → the gene is being downregulated (transcription has slowed; remaining spliced mRNA has not yet been degraded).

By computing this ratio across all genes for each cell, scVelo generates a velocity vector that indicates the probable direction of transcriptional change (Figure 5). When projected onto the UMAP embedding, these vectors form a flow field showing the predicted directionality of cell state transitions.

Figure 5: RNA velocity inference from splicing kinetics. Left: during gene activation, unspliced pre-mRNA (purple) accumulates faster than mature spliced mRNA (blue), yielding a positive velocity vector that predicts continued upregulation. Right: during gene repression, transcription has slowed but residual spliced mRNA persists, producing a negative velocity vector indicating future downregulation. These per-gene velocities are aggregated across all genes to predict each cell's future transcriptional state.

Methods & Parameters

Data requirements

RNA velocity requires quantification of both spliced and unspliced reads. Standard CellRanger count does not separate these; the data must be processed with velocyto (Loom files) or STARsolo with the --soloFeatures Gene Velocyto option. The unspliced/spliced layers must be present in the AnnData object.

Modes

Deterministic mode: Solves the ODE model for splicing kinetics analytically. Faster but assumes a single set of rate parameters per gene. Stochastic mode: Models the variance of the spliced/unspliced distributions, capturing intrinsic stochasticity of transcription. More robust for genes with complex kinetics but computationally more expensive.

Limitations

RNA velocity assumes first-order kinetics with a single transcriptional state per gene. It does not account for post-transcriptional regulation (miRNA silencing, mRNA stability modulation), alternative splicing, or genes with multiple transcriptional states. Velocity vectors for genes violating these assumptions will be unreliable. Additionally, the unspliced fraction is often noisier than spliced counts, particularly at shallow sequencing depths.

References

SCENIC: Gene Regulatory Networks

Biological Foundation

Cell identity is ultimately determined by the activity of transcription factors (TFs): proteins that bind specific DNA motifs in promoter/enhancer regions and regulate the expression of downstream target genes. SCENIC reconstructs these regulatory relationships from scRNA-seq data, identifying which TFs are active in each cell and which gene programs (regulons) they control (Figure 6).

Figure 6: The three-step SCENIC pipeline for gene regulatory network inference. GRNBoost2 identifies candidate TF-target co-expression modules using gradient boosting. RcisTarget validates these links by requiring the TF's binding motif in the target gene promoter, eliminating ~80–90% of spurious associations to produce curated regulons. AUCell scores the activity of each regulon in each individual cell using an area-under-the-curve metric, producing a cells × regulons activity matrix.

Computational Method

Step 1: GRNBoost2 (or GENIE3)

Uses gradient boosting regression to predict each gene's expression from TF expression levels, identifying TF-target co-expression modules. This step generates a large number of candidate regulatory links, many of which are indirect (gene A correlates with gene B because both are regulated by an unmodeled factor).

Step 2: RcisTarget

Filters candidate TF-target links by requiring the presence of the TF's binding motif in the promoter region (typically ±500bp around TSS) of the target gene. This critical validation step eliminates ~80–90% of spurious co-expression links, retaining only those supported by regulatory sequence evidence. The result is a set of regulonsA transcription factor and its validated target genes. Represents an active regulatory program inferred by SCENIC.: each consisting of a TF and its validated target genes.

Step 3: AUCell

Quantifies the activity of each regulon in each individual cell using the Area Under the recovery Curve (AUC) metric. For each cell, genes are ranked by expression, and the AUC measures how many regulon targets are among the top-expressed genes. This produces a cells × regulons activity matrix that can be used for clustering, differential analysis, and regulatory state characterization.

Example regulons by cell type

FOXP3 regulon: active in regulatory T cells (Tregs)
SPI1 (PU.1) regulon: active in myeloid lineage (monocytes, macrophages, dendritic cells)
PAX5 regulon: active in B lymphocytes
GATA1 regulon: active in erythroid progenitors
HNF4A regulon: active in hepatocytes

Processing note

SCENIC is the most computationally intensive module in scExplorer. It uses asynchronous processing with status polling; the analysis may take 30–60 minutes depending on dataset size and available resources. Progress can be monitored through the status indicator.

References

CellChat: Cell-Cell Communication

Biological Foundation

Cells within tissues do not operate in isolation. They communicate through a repertoire of signaling molecules: secreted ligands that bind cognate receptors on neighboring or distant cells, direct cell-cell contact via adhesion molecules, and extracellular matrix (ECM)-mediated signals. CellChat infers these intercellular communication networks from scRNA-seq data by evaluating the co-expression of ligand-receptor pairs across cell type pairs (Figure 7).

Figure 7: CellChat ligand-receptor communication inference. Sender cells expressing a ligand gene communicate with receiver cells expressing the cognate receptor. CellChat computes signaling probability for each ligand-receptor pair across all cell type pairs, incorporating co-factors and mediators from CellChatDB (>2,000 curated interactions for human and mouse). Statistical significance is assessed via permutation testing of cell type labels.

Computational Method

Signaling probability model

For each ligand-receptor pair, CellChat computes the probability of communication between each pair of cell types based on the average expression of the ligand in the sender population and the receptor in the receiver population, incorporating co-factors and mediators where known. A permutation test establishes statistical significance by shuffling cell type labels.

CellChatDB

The analysis relies on a curated database of >2,000 known interactions for human and mouse, categorized into:

Secreted signaling: Ligands released into the extracellular space (cytokines, chemokines, growth factors). Examples: CXCL12→CXCR4, TNF→TNFRSF1A.
Cell-cell contact: Direct interaction requiring physical proximity. Examples: NOTCH ligands (DLL1, JAG1)→NOTCH receptors, CD80/CD86→CD28/CTLA4.
ECM-receptor: Extracellular matrix components signaling through surface receptors. Examples: COL1A1→integrins, LAMA→dystroglycan.

Output interpretation

Communication networks: Directed graphs showing signaling strength between cell type pairs: edge weight reflects probability and number of significant interactions. Pathway-level analysis: Aggregation of individual interactions into pathways (WNT, NOTCH, TGFb) identifies dominant signaling programs. Sender/receiver roles: Identifies which cell types are predominantly sources vs targets of communication: useful for understanding immune microenvironment dynamics in tumors.

Important caveat

CellChat infers potential communication from expression data alone. It does not confirm that signaling actually occurs: protein-level expression, spatial proximity, and receptor activation state are not captured by scRNA-seq. Results should be interpreted as hypotheses to be validated experimentally (e.g., by spatial transcriptomics, receptor phosphorylation assays, or co-culture experiments).

References

Jin S, Guerrero-Juarez CF, et al. (2021). Inference and analysis of cell-cell communication using CellChat. Nature Communications.

scTag: Collaborative Cell Type Annotation

Purpose & Workflow

Cell type annotation: assigning biological identities to computationally defined clusters: is arguably the most subjective step in scRNA-seq analysis. It requires domain expertise, familiarity with tissue-specific marker panels, and often produces disagreements between researchers. scTag addresses this by enabling collaborative, multi-annotator workflows with transparent consensus building (Figure 8).

Annotation workflow

Export dataset from the scExplorer pipeline: the UMAP coordinates, cluster assignments, and top DEGs are loaded into scTag.
Interactive visualization: annotators explore the UMAP with cluster overlays and can query expression of specific markers in real time.
Independent annotation: each annotator labels clusters with cell type identities. The system provides marker-based suggestions using a scoring function (sigmoid-weighted scoring of positive markers that should be expressed and negative markers that should be absent).
Consensus generation: when multiple annotators complete their annotations, the system generates consensus labels using weighted voting. Annotator confidence scores are factored into the weighting.
Export: final consensus labels are exported as metadata (h5ad/RDS/CSV) for use in downstream analysis, publication, and data sharing.

When to use scTag vs Labeling

scTag is designed for collaborative projects: multi-laboratory consortia, large team studies, or datasets requiring cross-validation of annotations by multiple experts. Labeling (the in-pipeline module) is for individual workflows where a single analyst assigns cell types directly. Both produce the same output format; the choice depends on whether consensus-based validation is needed.

Figure 8: scTag collaborative cell type annotation workflow. Data is exported from the scExplorer pipeline with UMAP coordinates and cluster assignments. Multiple annotators independently explore markers and assign cell type labels. The consensus module aggregates annotations using confidence-weighted voting to produce final labels, which are exported as metadata (h5ad/RDS/CSV) for downstream use and publication.

7 Results Hub

Consolidated Output

The Results page aggregates all artifacts generated throughout the pipeline into a navigable dashboard organized by analysis step:

Raw Data: Original dataset summary: cell count, gene count, sparsity metrics.
QC Metrics: Violin plots and scatter plots of gene counts, UMI counts, and mitochondrial fraction before and after filtering.
Preprocessing: Normalization parameters, HVG list.
Embedding: PCA variance plots, UMAP coordinates, cluster assignments.
DEA: Ranked gene tables per cluster, volcano plots, statistical summaries.
Downstream Analyses: Trajectory graphs, velocity fields, SCENIC regulon heatmaps, CellChat networks (lazy-loaded for performance).

What to export for publication

For a typical manuscript: UMAP with cluster identities, annotated UMAP with cell type labels, top-10 marker gene heatmap, volcano plots for key comparisons, and relevant downstream analysis figures. All plots can be downloaded directly from the Results page. The underlying h5ad file contains all computed metadata for reproducibility.

Coherence check

Before finalizing results, verify consistency across steps: Do UMAP clusters correspond to distinct DEG profiles? Do trajectory directions agree with known biology? Do SCENIC regulons match expected TFs for annotated cell types? Inconsistencies may indicate suboptimal parameter choices in earlier steps.

Integration: Batch Effect Correction

Biological Foundation

When combining scRNA-seq datasets from different experiments, conditions, or laboratories, batch effectsSystematic technical variation between experiments that can confound biological signals.: systematic technical variations: can dominate the biological signal. Differences in cell dissociation protocols, library preparation kits, sequencing depth, or even the day of the experiment can cause cells of the same type to cluster separately by batch rather than by identity. Integration methods correct these technical variations while preserving genuine biological differences (e.g., treatment-induced changes, disease vs healthy states), as illustrated in Figure 10.

Figure 10: Batch effect correction through integration. Left: before integration, cells from Batch 1 (blue circles) and Batch 2 (red triangles) form separate clusters despite representing the same cell types, driven by technical variation between experiments. Right: after integration, cells cluster by biological identity (Cell type A, Cell type B) regardless of batch origin, with both batches intermixed within each cluster. Integration methods (Harmony, Scanorama, ComBat, BBKNN) remove systematic technical variation while preserving genuine biological differences between cell populations.

Available Integration Methods

Method	Approach	Best for	Limitations
Harmony	Iterative soft-clustering in PC space; adjusts embeddings to remove batch variation while preserving within-batch structure	Fast integration of batches with similar cell type compositions; large datasets; quick iteration	May over-correct if batches have genuinely different compositions; operates only in PC space (not on raw counts)
Scanorama	Finds mutual nearest neighbors (MNNs) across batches and uses these anchor pairs to learn a batch correction	Batches with partially overlapping compositions; preserves batch-specific populations	Slower than Harmony; requires sufficient shared cell types between batches as anchors
ComBat	Empirical Bayes linear model that estimates and removes batch-specific shifts and scaling	Simple technical variation (sequencing depth differences, kit differences) with known batch labels	Assumes linear batch effects; can over-correct complex non-linear variations; may not handle cell composition differences well
BBKNN	Modifies the k-NN graph construction to balance neighbors across batches, ensuring each cell's neighborhood includes cells from all batches	Preserving rare populations; moderate batch effects; graph-based analyses	Does not produce corrected expression values (correction is at the graph level only); depends on balanced batch sizes

Over-correction risk

Integration methods can remove genuine biological variation if it is confounded with batch identity. If condition A was processed entirely in batch 1 and condition B in batch 2, integration will remove both technical and biological differences. Experimental design should distribute conditions across batches whenever possible. Always compare integrated and non-integrated results to verify that biological signals are preserved.

References

Labeling: Manual Cell Type Assignment

Purpose

The Labeling module allows you to assign cell type names to Leiden clusters directly within the pipeline. After reviewing DEA results and validating markers in the Visualization module, each cluster receives a biological identity label (e.g., "CD4+ T cells," "Classical monocytes," "Alveolar type II cells").

Merging over-clustered populations

When multiple clusters represent fragments of the same biological population (common at high Leiden resolution), Labeling allows you to assign the same label to multiple clusters. This effectively merges them for downstream interpretation without needing to re-run the clustering. For example, if clusters 3, 7, and 12 all express CD3E, CD4, and TCF7 (naive CD4+ T cell markers) and their DEGs show only quantitative (not qualitative) differences, they can all be labeled "Naive CD4+ T cells."

Labeling vs scTag

Use Labeling for individual analysis workflows where you are the sole annotator. Use scTag when multiple experts need to independently annotate and reach consensus: this is standard practice for collaborative publications and consortium-level datasets.

Glossary

AnnData: Annotated data matrix format used by Scanpy. Stores the count matrix (X), cell metadata (obs), gene metadata (var), embeddings (obsm), and additional layers.

Barcode: Short DNA sequence (typically 16 nt in 10x) that uniquely identifies a cell-containing droplet. Each cell's reads carry a shared barcode.

Batch effect: Systematic technical variation between experiments or samples that can confound biological signals. Sources include different library prep dates, sequencing runs, or operators.

Doublet: A barcode representing two cells captured in the same droplet. Produces a hybrid transcriptome that can appear as an intermediate state or novel population.

Dropout: Failure to detect an expressed transcript due to the stochastic nature of mRNA capture and reverse transcription. Results in excess zeros in the count matrix.

Elbow plot: Plot of variance explained per principal component. The "elbow" indicates the transition from biologically informative PCs to noise-dominated ones.

HVG: Highly Variable Gene. A gene whose variance across cells exceeds the expected technical variance: indicates biological relevance for cell type distinction.

k-NN graph: k-Nearest Neighbor graph. Network connecting each cell to its k most similar neighbors in PC space. Substrate for clustering and UMAP.

Leiden: Community detection algorithm that partitions the k-NN graph into clusters by optimizing modularity with guaranteed connectivity.

Log₂FC: Log₂ fold change. Measure of expression difference between groups. log₂FC of 1 = 2× increase; log₂FC of 2 = 4× increase.

Pseudotime: Continuous variable ordering cells along an inferred differentiation trajectory. Not real time: represents transcriptional progression.

Regulon: A transcription factor and its set of validated target genes, as inferred by SCENIC. Represents an active regulatory program.

Spliced / Unspliced: Mature mRNA (introns removed) vs pre-mRNA (introns retained). Their ratio informs RNA velocity: the predicted direction of transcriptional change.

UMAP: Uniform Manifold Approximation and Projection. Non-linear dimensionality reduction for visualization. Preserves local topology; global distances are not meaningful.

UMI: Unique Molecular Identifier. Short random sequence appended during library prep to tag individual mRNA molecules, enabling PCR duplicate removal and absolute quantification.

Zero-inflated: Statistical property of scRNA-seq data where the number of zeros exceeds what is expected from a standard count distribution, due to dropout and biological absence combined.

Frequently Asked Questions

How do I choose the right mitochondrial threshold for my tissue?

Plot the distribution of mitochondrial fractions before applying any filter. Most tissues show a bimodal distribution: a main peak (viable cells) and a tail extending to high values (damaged cells). Place the threshold at the inflection point between these populations. Metabolically active tissues (heart, kidney, liver) naturally have higher mitochondrial content: do not apply a 5% cutoff universally. Consult published scRNA-seq studies of your specific tissue for typical ranges.

My clusters don't match known cell types: what should I check?

First, verify that QC filtering didn't remove a critical population. Second, try different resolution values: your expected cell types may be merged at low resolution or fragmented at high resolution. Third, confirm that the marker panel is appropriate for your tissue and species. Fourth, check for batch effects if integrating multiple samples. Finally, consider that your dataset may genuinely contain novel or transitional states not yet characterized in literature.

When should I use trajectory analysis vs RNA velocity?

Trajectory analysis (PAGA+DPT) infers connectivity and ordering from the static transcriptomic landscape: it works with any scRNA-seq dataset and identifies which populations are connected and their relative position along differentiation paths. RNA velocity adds directionality by using splicing kinetics: it tells you which direction cells are moving along a trajectory. Use trajectory first to establish the landscape; add velocity to determine direction. Note that velocity requires spliced/unspliced quantification, which is not available from all processing pipelines.

How many cells do I need for reliable analysis?

It depends on the biological question. For identifying major cell types (5–10 populations) in a relatively homogeneous tissue, 3,000–5,000 cells are often sufficient. For detecting rare populations (<1% frequency), you need at least 10,000–50,000 cells to ensure statistical representation. For trajectory and velocity analyses, denser sampling of transitional states improves reliability. Deeper sequencing per cell (more reads) improves gene detection sensitivity but does not substitute for cell number when studying heterogeneity.

Can I compare expression levels directly between clusters in UMAP plots?

No. UMAP color intensity reflects the expression value mapped to a color scale, but visual perception is confounded by cell density (dense regions appear more saturated), UMAP distortions, and the specific color map used. For quantitative comparisons, use violin plots, box plots, or dot plots from the Visualization module. UMAP expression overlays are best used for spatial pattern assessment: where is a gene expressed: rather than how much.

What does it mean if SCENIC shows no active regulons for a cluster?

Several possibilities: (1) The cell type's master regulators are not in the TF database used by SCENIC. (2) The regulatory program depends on post-transcriptional regulation not captured by expression-based inference. (3) The cluster may be defined by repression rather than activation of specific programs. (4) Low cell numbers in the cluster reduce statistical power. Consider examining the top DEGs manually for known TFs that might be below the activity threshold.

Should I integrate my samples before or after quality control?

Always perform QC filtering independently on each sample before integration. Each sample may have different quality profiles (dissociation efficiency, viability) requiring sample-specific thresholds. Integrating before QC risks propagating low-quality cells from one sample into shared embedding space, distorting the correction. After per-sample QC, concatenate the filtered datasets and then run the integration method.