scRNA-seq experiments produce data with high dimensions in the number of genes and cells. A 10X library could be composed of ~10,000 cells with ~30,000 detected genes (numbers can vary depending on sequencing depth and handling quality). However, not all genes provide relevant information for the analysis.
In this regard, the next step is to select the most variable genes (MVG). In (1), you can indicate the desired number of genes to calculate and further be used for Principal Component Analysis (PCA). Next, you must indicate the method for MVG calculation (2). Here, we have two methods for MVG selection: Seurat and Cell Ranger. For this tutorial we select 2000 MVG and Seurat as method (default parameters). Click on Run (3) to continue.
The number of optimal MVG will depend on the complexity of the dataset. Usually, a range between 1,000 and 5,000 MVG works well across single-cell data. Thus, we encourage users to test different numbers of MVG.
MVG should be determined after quality control, to guarantee that low-expressed genes and low-quality cells do not interfere with the analysis. Also, it is important to determine whether the data has technical noise to avoid selecting MVG for only one batch. If your data has batch effect, you can correct and integrate it in the Integration section.
We offer two flavors for MVG selection: Seurat and Cell Ranger. MVG are identified based on their mean and dispersion (variance/mean). For more details about MVG method selection please see [2, 3, 4].
After running the analysis, you will see three plots for PC1 and PC2 highlighting:
Plotting the top 2 PCs is useful to see undesired features such as batch and QC metrics generating significant variation in your dataset. In (4) you can see the top 25 PC ranked according to the Variance Ratio and in red the suggested number of PC to be used for downstream analysis. For this tutorial, we will use the first 10 PC but we encourage users to explore using more or less PC.
In addition, below you will find a clustree (5) which is a commonly used tool (particularly in R/Seurat) to explore and interpret how clusters change across different levels of clustering resolution. This visualization is especially useful when you apply clustering algorithms, like Louvain or Leiden, which allow for tuning a resolution parameter that influences the granularity of the resulting clusters. Here, clustree is calculated using a Shared Near Neighbors (SNN) graph and will be useful for users to use it as a reference during the construction of the K-Near Neighbors (KNN) graph downstream.
In the context of single-cell genomics, a principal component (PC) is a mathematical construct derived from principal component analysis (PCA), a dimensionality reduction technique used to simplify complex datasets.
Each PC represents a direction in the data that captures as much variance as possible. The first principal component (PC1) captures the most variance, the second principal component (PC2) captures the second most, and so on, with each subsequent component being orthogonal (uncorrelated) to the others.
PCA is useful in single-cell genomics because of the high-dimensional nature of the data, where each cell is represented by thousands of gene expression values. By reducing the data to just a few principal components, you can visualize the relationships between cells in a lower-dimensional space without losing much of the information.
Once you are satisfied with the PC analysis, the next step is to construct the KNN graph from the desired number of PCs and embed in two dimensions for visualization with Uniform Manifold Approximation and Projection (UMAP).
Parameter settings:
For this tutorial, we will set the number of PC to 15, number of neighbors to 15, and Leiden resolution to 1 (default parameters). We encourage users to explore multiple parameters. Click on Run (4) to generate the UMAPs.
In single-cell genomics, both the Louvain and Leiden algorithms are widely used for clustering cells based on their gene expression profiles. These algorithms help identify groups of cells (clusters) that share similar gene expression patterns, representing distinct cell types or states within a complex dataset.
Why Leiden is preferred over Louvain:
The number of neighbors (K in KNN): This critical parameter influences how cells are grouped and visualized. Low K creates tighter, more localized neighborhoods (fine-grained clusters), while higher K values create more connected graphs (broader clusters).
UMAP is a dimensionality reduction technique widely used in single-cell genomics to visualize high-dimensional data in a lower-dimensional space. UMAP is particularly adept at preserving both the global and local structure of the data, making it a powerful tool for uncovering patterns and relationships that might be hidden in the high-dimensional space.
After running the analysis, scExplorer will show four UMAPs:
Next, click on DEA (5) to continue to the Differential Expression Analysis (DEA).