Accurately Clustering Single-cell RNA-seq data by Capturing Structural Relations between Cells through Graph Convolutional Network

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies provide a great opportunity to study gene expression at cellular resolution, and the scRNA-seq data has been routinely conducted to unfold cell heterogeneity and diversity. A critical step for the scRNA-seq analyses is to cluster the same type of cells, and many methods have been developed for cell clustering. However, existing clustering methods are limited to extract the representations from expression data of individual cells, while ignoring the high-order structural relations between cells. Here, we proposed a new method (GraphSCC) to cluster cells based on scRNA-seq data by accounting structural relations between cells through a graph convolutional network. The representation learned from the graph convolutional network, together with another representation output from a denoising autoencoder network, are optimized by a dual self-supervised module for better cell clustering. Extensive experiments indicate that GraphSCC model outperforms state-of-the-art methods in various evaluation metrics on both simulated and real datasets.


I. INTRODUCTION
Recent advances in single-cell transcriptomics permit researchers to study complex differentiation and developmental trajectories [1], to discover novel cell types [2], and to improve understanding of human diseases. As scRNA-seq usually includes thousands to millions of cells, most of these studies require clustering the same types of cells. Benefit from extensive studies of the classic clustering problem in machine learning, many algorithms that have been developed for general clustering purposes are transplanted to the clustering analyses [3].
Despite the advances of many clustering methods and significant improvements in measuring scRNA-seq technologies, clustering cells based on scRNA-seq data remains challenging. To be more specific, scRNA-seq data often contains substantial noise and dropout events, which may be caused by biological and experimentally technical factors, such as cell cycle effects, amplification bias, and the low RNA capture rate [4]. A dropout event is defined as missed gene measurements, resulting in a 'false' zero count observation [5]. Therefore, solving the noises and dropout events is essential for improving clustering analyses. In order to solve the dropout events of scRNA-seq data, several imputation methods have been developed. Early methods are based on statistical models, e.g. CIDR [6] and MAGIC [7]. With the development of deep learning techniques, many neural-network based models have been developed. For example, DCA [5] reconstructs the scRNA-seq data through the autoencoder optimized by a loss function of the zero-inflated negative binomial (ZINB). Although the imputed scRNA-seq data could improve the clustering results, the results remain unsatisfactory as these methods are not optimized for the clustering and the imputed data may generate false positive gene-gene correlations [8].
Recently, several clustering methods have been specifically developed for scRNA-seq data. For example, SIMLR is a spectral clustering method that learns a robust distance metric to fit the structure of scRNA-seq data using multiple kernels [9]. Seurat3.0 utilizes the Louvain algorithm to cluster cells based on the low-dimensional scRNA-seq data [10]. DendroSplit uncovers multiple levels of biologically meaningful cell populations using feature selection in scRNA-seq data [11]. ScDeepCluster is a single-cell model-based deep embedded clustering method, which takes the overdispersion and sparsity of the scRNA-seq data into account [12]. However, most of these methods rely on only the data of individual cells without explicitly considering structural relations between cells.
Structural information could be naturally captured by the Graph Convolutional Networks (GCN) [13]. In recent years, GCN has been successfully applied to a wide range of applications. Guo et al designed an unsupervised deep embedding method for clustering analysis (IDEC) [14], which iteratively refines clusters by learning highly confident assignments using an auxiliary target distribution. Recently, Bo et al designed SDCN to integrate structural information between objects [15]. Theoretically, they have proved that inclusion of GCN enables a high-order regularization constraint to learn representations that improve the clustering results.
Inspired by these works, we have proposed a new GCN based model, GraphSCC, to integrate structural information in the clustering of scRNA-seq data. Meanwhile, a denoising autoencoder network was employed to obtain low dimensional representations for capturing local structural, and a dual selfsupervised module was utilized to optimize the representations and the clustering objective function teratively in an unsupervised manner. We demonstrate that the GraphSCC outperforms state-of-the-art methods on both simulated and real datasets.

A. Datasets and Preprocessing
Simulated Data Our simulated data were generated by the Splatter R package [16]. For all simulated data, we set 2000 cells composed of 2000 genes with four groups of the same numbers. we set , , and used default values for other parameters. To simulate various clustering signal strengths, we generated datasets with different Sigma in . Real Data We analyzed multiple scRNA-seq datasets of hu-man and mouse scRNA-seq involved in various tissues and different biological processes as used in the previous study [17]. The published datasets used in this paper are available by the following accession numbers: GSE84133, GSE67835, GSE45719, E-MTAB-3321, GSE65525, GSE81861, GSE74672, E-MTAB-5061, GSE60361, GSE57249, GSE71585, GSE52583, GSE81608, GSE36552.
Preprocessing For the simulated datasets, we normalized them using transcripts per million (TPM) method [18] and then scaled the value of each gene to [0, 1]. For real datasets, we employed the procedure suggested by Seurat3.0 to normalize and select top 2000 highly variable genes for scRNA-seq data, then to scale the value of each gene to [0,1]. Dual Self-f f Supervised Module

B. GraphSCC Architecture
In this section, we introduce our proposed GraphSCC network. As shown in Fig. 1, the overall network consists of three components: Denoising Autoencoder (DAE), Graph Convolutional Network (GCN), and Dual Self-Supervised Module (DSM). The DAE network is employed to obtain robust lowdimensional representations that could reconstruct the inputs. GCN aims to optimize the representations with constraints from structural information between cells, where the graph is initialized through the K-nearest neighbor (KNN). DSM aims to separate the cells according to the learned representations by DAE and GCN. Below we will introduce the details.

DAE Networks
For a given single cell sample, gene expression is represented by the matrix , with N as the number of single-cell and d as the dimension of expressed genes. DAE is employed to encode gene expression matrix to fixed-size vector representations. Concretely, at encoding layer the output is computed as: where is the activation function, and are the weight matrix and bias parameters, respectively.
, and is the Gaussian noise. The decoded layers are computed as: where is the activation function, and are the weight matrix and bias parameters, respectively and the output of the last layer is the reconstructed data . The encoder and decoder networks are both fully connected neural networks using the RELU activation function. The weight and bias parameters are optimized based on the loss function:

KNN Graph
The GCN graph was initialized by KNN, where the cell similarity is computed through the dot product function as: where and are the embedded features for cells and , and are the corresponding modules, respectively. According to the similarity matrix , the most similar cells of each cell are selected as its neighbors to construct the adjacency matrix for GCN. In this study, the number of neighbors was set as at most 1% of the total number of cells with a maximum of 20.

GCN Network
GCN network was constructed to capture structural information between cells that have been ignored by DAE network. Inspired by recent study [19], we use residual connection to alleviate the well-known over-smoothing phenomenon in GCN [13]. Due to the large feature dimension d of X, we use a lowerdimensional as the initial representation of GCN, which is extracted from X by a fully-connected neural network as follows: where , as the initial representation of GCN, m is lower than d. W and b are weight matrix and bias parameters, respectively.
Then, based on the adjacency matrix A obtained from KNN and the initial representation , with the weight matrix , the representation learned by GCN can be obtained by the following convolutional operation: where two hypermeters > 0 ensures each layer retaining the information from input layer and ensures the decay of the weight matrix adaptively with stacked layers.
with A as the adjacency matrix obtained from the KNNgraph and is the identity matrix.
is the normalization term. The is the RELU activation function. We set and = 0.3 following the previous study [19]. The last layer in GCN module is connected by a softmax function: where the output Z could be treated as the probability distribution, c is the number of clusters. The result represents the probability cell belongs to cluster center .

Dual Self-Supervised Module
The dual self-supervised module was designed to guide the learning of low-dimensional representations by DAE and GCN for better cell clustering. Concretely, for the -th single cell and the -th cluster, the Student's t-distribution [20] is used as the kernel to calculate the similarity between the cluster center vector and the data representation as follows: where is the degree of freedom of the Student's t-distribution ( set as 1 in this study), and can be treated as the probability of -th cell belonging to -th cluster. Based on the computed distribution , the target distribution is computed by: where is soft frequency for cluster j. For better clustering, a loss function could be defined to minimize the Kullback-Leibler (KL) divergence between two probability distributions as: By minimizing the loss between the distribution and , the target distribution can help DAE module learn better representations for clustering cells. Since the target distribution is defined by , minimizing is a form of self-supervised learning mechanism.
To integrate the structural information between cells, we also used the target distribution to supervise the updating of distribution as follows: Because both the DAE and GCN modules are supervised by the target distribution , we call it a dual self-supervised mechanism. Finally, the loss function of GraphSCC is defined as: where and are two hyper-parameters to balance contributions from local structure preservation of scRNA-seq data and the clustering optimization. In this study, we set and for all the datasets. Since representations learned by the GCN contain both structural information and characteristic information learned from DAE, the soft assignments in distribution are used as the final clustering results. Thus, the label assigned to cell is: where is calculated in Eq. (7) for cell in the j-cluster.

Implementation
The GraphSCC model was implemented in python 3 using PyTorch. The dimensions of DAE is set to d-512-256-64-10, where d is the dimension of the input data, and 10 is the dimension of the bottleneck latent . DAE is first pre-trained for 400 epochs by the optimizer Adam with the initial learning rate and the batch size of the pre-train equaling to 32. We used the "Randn" function in PyTorch to generate Gaussian noises. Note that, for simulated data, we reduce the noise value to 0.2 times its value. We set the layers of GCN as 5, and set the dimensions of the initial representation and the hidden layer of GCN as 256. The optimizer for the clustering stage is Adam with setting , and the clustering training epoch is 1000. For all other competing methods, we use the default parameters provided in the original articles.

A. Evaluation of GraphSCC
We first evaluated GraphSCC on simulated data. To investigate the performance of GraphSCC under different sceneries, we employed Splatter in the R package to generate scRNA-seq data with different "sigma" in the log-normal distribution. A greater sigma value represents more significant distances between cells from different clusters with lower clustering difficulty. As shown in Fig. 2, GraphSCC consistently achieved the highest NMI values. Though Seurat3.0 and IDEC could reach essentially the same NMI (~0.98) as GraphSCC at sigma of 0.4, Seurat3.0 and IDEC have a sharp drop to 0.25 and 0.37, respectively at sigma of 0.2. Relatively, GraphSCC is much flatter with NMI of 0.76 at sigma of 0.2. CIDR, scDeepCluster, and DCA performed bad at sigma of 0.2 with NMI below 0.1, but they can achieve decent results at sigma of 0.4 with NMI of 0.72, 0.77, and 0.9, respectively. The SIMLR failed (NMI close to 0) for exploring clustering signals when sigma is below 0.3, which is consistent with the previous observation [12].
We further investigated the clustering performance of GraphSCC on real scRNA-seq datasets with different species and tissues. As shown in Fig. 3, the clustering results of GraphSCC outperformed other competing methods in all evaluation metrics. In average, GraphSCC achieved 0.798, 0.791, and 0.744 for the CA, NMI, and ARI, respectively. These are respectively 13.3%, 9%, and 19% higher than the ones achieved by Seurat3.0, generally the 2nd best method. scDeepCluster and SIMLR methods, different from the bad performances in the simulated datasets, achieved comparable NMI to Seurat3.0 and ranked the 4 th and 5 th . CIDR has an average NMI of 0.64. IDEC obtained the lowest NMI values; it is likely because the autoencoder module in IDEC cannot deal well with noisy scRNA-seq data. When measured by CA or ARI, these methods had generally similar ranks. The consistent outperformances of GraphSCC over the simulated and real datasets indicated the robustness of our method.
IV. CONCLUSION In this study, we integrated the structural relations between cells into scRNA-seq clustering by the graphical neural network. The structural deep clustering model GraphSCC consists of GCN, DAE, and DSM modules. GraphSCC is able to effectively capture the relations between cells and the characteristics of data by learning representations using the GCN and DAE modules, respectively. Furthermore, DSM was applied to cluster cells based on representations by iteratively optimizing the clustering objective function in an unsupervised manner. We have demonstrated that the clustering performance of GraphSCC outperformed the competing methods on both simulated and real datasets.