scRNA-seq experiments produce data with high dimensions in the number of genes and cells. A 10X library could be composed of ~ 10,000 cells with ~ 30,000 detected genes (numbers can vary depending on sequencing depth and handing quality). However, not all genes provide relevant information for the analysis. In this regard, the next step is to select the most variable genes (MVG). In (1), you can indicate the desired number of genes to calculate and further be used for Principal Component Analysis (PCA). Next, you must indicate the method for MVG calculation (2). Here, we have two methods for MVG selection: Seurat and Cell Ranger. For this tutorial we select 2000 MVG and Seurat as method (default parameters) Click on Run (3) to continue.
The number of optimal MVG will depend on the complexity of the dataset. Usually, a range between 1,000 and 5,000 MVG works well across single-cell data. Thus, we encourage users to test different numbers of MVG. MVG should be determined after quality control, to guarantee that low-expressed genes and low-quality cells does not interfere with the analysis. Also, is important to determine whether the data has technical noise to avoid selecting MVG for only one batch. If your data has batch effect, you can correct and integrate it in the Integration section. We offer two flavors for MVG selection: Seurat and cellranger. MVG are identified based on their mean and dispersion (variance/mean). For more details about MVG method selection please see [2, 3, 4]
After running the analysis, you will see three plots for PC1 and PC2 highlighting: % of Mitochondrial Genes (1), Total Counts (2), and number of Genes per Counts (3). Plotting the top 2 PCs is useful to see undesired features such as batch and QC metrics generating significant variation in your dataset. In (4) you can see the top 25 PC ranked according to the Variance Ratio and in red the suggested number of PC to be used for downstream analysis. For this tutorial, we will use the first 10 PC but we encourage users to explore using more or less PC. In addition, below you will find a clustree (5) which is a commonly used tool (particularly in R/Seurat) to explore and interpret how clusters change across different levels of clustering resolution. This visualization is especially useful when you apply clustering algorithms, like Louvain or Leiden, which allow for tuning a resolution parameter that influences the granularity of the resulting clusters. Here, clustree is calculated using a Shared Near Neighbors (SNN) graph and will be useful for users to use it as a reference during the construction of the K-Near Neighbors (KNN) graph downstream.
In the context of single-cell genomics, a principal component (PC) is a mathematical construct derived from principal component analysis (PCA), a dimensionality reduction technique used to simplify complex datasets. Each PC represents a direction in the data that captures as much variance as possible. The first principal component (PC1) captures the most variance, the second principal component (PC2) captures the second most, and so on, with each subsequent component being orthogonal (uncorrelated) to the others. PCA is useful in single-cell genomics because of the high-dimensional nature of the data, where each cell is represented by thousands of gene expression values. By reducing the data to just a few principal components, you can visualize the relationships between cells in a lower-dimensional space without losing much of the information.
Once you are satisfied with the PC analysis, the next step is to construct the KNN graph from the desired number of PCs and embedded in two dimensions for visualization with Uniform Manifold Approximation and Projection (UMAP). In (1) you can set the desired number of PC and in (2) the number of neighbors for community detection. Also, in (3) you need to set the resolution for cluster identification using Leiden graph-clustering method. For this tutorial, we will set the number of PC to 15, number of neighbors to 15, and Leiden resolution to 1 (default parameters). We encourage users to explore multiple parameters. Click on Run (4) to generate the UMAPs.
In single-cell genomics, both the Louvain and Leiden algorithms are widely used for clustering cells based on their gene expression profiles. These algorithms help identify groups of cells (clusters) that share similar gene expression patterns, representing distinct cell types or states within a complex dataset. The Leiden algorithm is now preferred over Louvain for several reasons [5]. Leiden produces more meaningful and accurate clusters than Louvain, especially when dealing with complex datasets, such as single-cell genomics, where many small but biologically significant clusters may exist. Also, Leiden handles different resolutions better than Louvain. Resolution is a parameter that controls the granularity of clusters. Leiden tends to yield more stable results across various resolution settings, making it more versatile in capturing both fine-grained and broad biological structures. The number of neighbors (typically denoted as K in KNN) is a critical parameter that influences how cells are grouped and visualized in methods like UMAP and KNN-based clustering. The choice of K determines the structure of the KNN graph, which is used as the foundation for dimensionality reduction (e.g., UMAP) and clustering algorithms (e.g., Louvain or Leiden). Low K creates tighter, more localized neighborhoods, meaning that each cell is only connected to a few of its nearest neighbors. This can result in more fine-grained clusters that capture local variations, making it easier to detect small or rare cell populations. Higher K value connects each cell to more neighbors, creating a more connected graph. This tends to merge smaller clusters into larger ones, which can capture broader trends but may miss fine-grained or rare cell populations. UMAP is a dimensionality reduction technique widely used in single-cell genomics to visualize high-dimensional data in a lower-dimensional space. UMAP is particularly adept at preserving both the global and local structure of the data, making it a powerful tool for uncovering patterns and relationships that might be hidden in the high-dimensional space.
After running the analysis, scExplorer will show four UMAPs where in (1) the Leiden Clusters are highlighted and in (2-4) % of Mitochondrial genes, Total Counts, and number of Genes per Counts are plotted, respectively. Next, click on DEA (5) to continue to the Differential Expression Analysis (DEA).