Identify gestational diabetes mellitus by deep learning model from cell-free DNA at the early gestation stage

Abstract Gestational diabetes mellitus (GDM) is a common complication of pregnancy, which has significant adverse effects on both the mother and fetus. The incidence of GDM is increasing globally, and early diagnosis is critical for timely treatment and reducing the risk of poor pregnancy outcomes. GDM is usually diagnosed and detected after 24 weeks of gestation, while complications due to GDM can occur much earlier. Copy number variations (CNVs) can be a possible biomarker for GDM diagnosis and screening in the early gestation stage. In this study, we proposed a machine-learning method to screen GDM in the early stage of gestation using cell-free DNA (cfDNA) sequencing data from maternal plasma. Five thousand and eighty-five patients from north regions of Mainland China, including 1942 GDM, were recruited. A non-overlapping sliding window method was applied for CNV coverage screening on low-coverage (~0.2×) sequencing data. The CNV coverage was fed to a convolutional neural network with attention architecture for the binary classification. The model achieved a classification accuracy of 88.14%, precision of 84.07%, recall of 93.04%, F1-score of 88.33% and AUC of 96.49%. The model identified 2190 genes associated with GDM, including DEFA1, DEFA3 and DEFB1. The enriched gene ontology (GO) terms and KEGG pathways showed that many identified genes are associated with diabetes-related pathways. Our study demonstrates the feasibility of using cfDNA sequencing data and machine-learning methods for early diagnosis of GDM, which may aid in early intervention and prevention of adverse pregnancy outcomes.


INTRODUCTION
Gestational diabetes mellitus (GDM) is a condition that affects a significant number of pregnant women.It is characterized by any degree of glucose intolerance that arises during pregnancy [1][2][3][4].According to the International Diabetes Federation, approximately one in six pregnant women experience hyperglycemia, with 84% of these cases being attributed to GDM [4,5].GDM tends to affect women at a younger age and has significant adverse effects on both the mother and the fetus [6].Research has shown that pregnant women who develop GDM are at a higher risk of developing metabolic diseases, such as type II diabetes and obesity, later in life [1,5].Furthermore, the adverse intrauterine environment brought about by GDM can permanently damage the fetus's organ function and structure, resulting in metabolic diseases related to growth and development, such as cardiovascular diseases and diabetes [7][8][9][10][11].
The International Association of Diabetes and Pregnancy Study Group (IADPSG) recommended diagnostic criteria based on the study results of the Hyperglycemia and Adverse Pregnancy Outcome (HAPO) [5].However, there is no consensus on the diagnosis criteria for early GDM due to insufficient data in diagnosing and testing GDM in early gestation [12,13].The delay in diagnosis and the subsequent disorder management increases the risk of poor pregnancy outcomes, especially for high-risk women with advanced age or obesity [14][15][16][17].For example, studies suggest that fetal macrosomia, a prevalent complication associated with GDM, may develop as early as the 20th week of pregnancy [18,19].
Copy number variations (CNVs) can be a possible biomarker for GDM diagnosis in the early gestation stage.Former studies have reported CNVs related to pregnancy complications by sequencing the cell-free DNA (cfDNA) in the maternal plasma [20,21].A study with cfDNA had identified CNVs in three chromosome regions related to a high risk of GDM.Gene enrichment analysis reveals that the genes in these regions contain some alpha-and betadefensin family members, such as DEFA1, DEFA3 and DEFB1, copy number variations of which are associated with type I and II diabetes [22].However, there are no existing methods for early prediction of GDM based on the CNV markers.
This study proposes a machine-learning method for early-stage GDM prediction using low-coverage genome-wide sequencing data of cfDNA in maternal plasma.A non-overlapping sliding window method with matrix imputation was applied for CNV coverage screening, and a deep neural network with attention architecture was used for binary classification.The model achieved a classification accuracy of 88.14%, precision of 84.07%, recall of 93.04%, F1-score of 88.33% and AUC of 96.49%.We also identified 2190 genes associated with GDM in the early gestation stage, including DEFA1, DEFA3 and DEFB1.These genes were enriched in 327 GO terms and 18 KEGG pathways with an adjusted P-value lower than 0.05.Many of them have previously been associated with diabetes mellitus, including the glutamate receptor signaling pathway [23,24] and the type II diabetes mellitus pathway [25,26].The study's findings suggest that the proposed machine-learning method can effectively identify early-stage GDM biomarkers and provide a basis for early diagnosis and management of the condition.

Sample collection and clinical PCA
A total of 5085 pregnant women from north regions of Mainland China were retrospectively enrolled in this study, and blood samples were collected between 12 and 26 weeks after gestation.The clinical information pertaining to age and weight was collected and is listed in Supplementary Table 1.The majority of the samples (62.20%) were collected between 12 and 18 weeks after gestation.The diagnosis of gestational diabetes mellitus (GDM) followed the recommended criteria by the IADPSG.Pregnant women underwent a 75 g oral glucose tolerance test between 24 and 28 weeks of gestation.GDM was diagnosed if any of the following glucose values was equal to or exceeded: fasting plasma glucose of 5.1 mmol/L, 1-h plasma glucose of 10.0 mmol/L or 2-h plasma glucose of 8.5 mmol/L.Accordingly, 1942 samples who were diagnosed as GDM based on these diagnostic criteria were enrolled.In the subsequent analysis, the normal samples were labeled as 0, while the GDM samples were assigned as 1.Principal component analysis (PCA) was conducted using the clinical attributions as features.Samples with missing values were filtered out, and the values of remaining sample features were normalized to a range of −1 to 1.The PCA was performed using the 'scikit-learn' package in Python.The study was approved by the Hospital Ethics Committee of the Beijing Obstetrics and Gynecology Hospital (approval no.2018-KY-003-02), and all patients provided signed informed consent.The experiments were conducted in accordance with the guidelines of the National Heath Commission of the People's Republic of China.

DNA extraction, library construction and genome sequencing
Five milliliters of peripheral venous blood samples was collected into tubes containing ethylene diamine tetraacetic acid from each participant.All blood samples were processed within 8 h by a double-centrifugation protocol.Blood samples were first centrifuged at 1600 × g for 10 min, and the supernatant was recentrifugated at 16 000 × g for 10 min to remove residual cells.The resultant cell-free plasma was stored at −80 • C.Then, 200 μL of maternal plasma was used for cfDNA extraction by BGISP-300 (BGI, Shenzhen, China) with a nucleic acid extraction kit (BGI, Shenzhen, China).
After DNA extraction, end-repair was carried out by adding end-repair enzymes under the following cycle conditions: 37 • C for 10 min and 65 • C for 15 min, followed by adaptor ligation with label-adaptor and ligase at 23 • C for 20 min.After the end-repair and adaptor ligation, PCR was performed to amplify DNA to the desired concentration under the following cycle conditions: 98 • C for 2 min, then 12 cycles at 98 • C for 15 s, 56 • C for 15 s and 72 • C for 30 s, with a final extension at 72 • C for 5 min.The DNA amplification products were quantified by Qubit ® 2.0 (Life Tech, Invitrogen, Carlsbad, CA, USA) using Qubit TM dsDNA HS Assay Kits (Life Tech, Invitrogen, USA), and the concentration ≥2 ng/μL was regarded as qualified standards.The volume was calculated according to the concentration of each sample, and each sample of the same mass was mixed by pooling.The fetal chromosome aneuploidies (T21, T18 and T13) detection kit (Combinatorial Probe-Anchor Synthesis Sequencing Method, CPAS) (BGI, Shenzhen, China) was used for library construction.Double-strand DNA was thermally denatured into singlestrand DNA after pooling, followed by the addition of cyclic buffer and ligase to create DNA circles by cyclization.Qualified DNA circles were used to make DNA Nanoballs (DNBs) by rolling-circle replication.The concentration of DNBs was quantified by Qubit ® 2.0 using Qubit TM ssDNA assay kits (Life Tech, Invitrogen, USA), and DNB concentrations within the range of 8-40 ng/μL were considered appropriate.Then, DNBs were loaded onto chips and sequenced on a BGISEQ-500 sequencing platform (BGI, Shenzhen, China) by the SE35 model.
The average depth of sequencing for each sample was approximately 0.2×, with a minimal number of unique sequencing reads of no less than 6 million obtained per sample for analysis.

Sequencing alignment and quality control
The sequencing data processing steps are shown in Figure 1A.We first aligned the genome sequencing data of cfDNA to the human genome (hg38) by bwa-mem2 (https://github.com/bwa-mem2/bwa-mem2)[27].Due to the low sequencing coverage, we filtered out the reads that are not in USCS Mappability track wgEncodeCrgMapabilityAlign100mer.bigWig(https://genomeeuro.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=mappability).The UCSC Mappability track provides information about the mappability of genomic regions.It indicates the regions in the genome that can be accurately mapped or aligned by sequencing reads.To estimate the impact of complex genome regions like segmental duplications, we further procured the segmental duplicated regions dataset from the UCSC Genome Browser's repository and embarked on an in-depth analysis.This analysis encompassed an evaluation of the distribution of base pair sequencing depths within both segmental duplication and nonsegmental duplication regions, as showcased in Supplementary Figure 3. Specifically, we can observe that base pairs situated within segmental duplicated regions tend to exhibit diminished read depth.This suggests that these regions encounter challenges with respect to read mapping, likely attributed to their intricate structural characteristics.Thus, we excluded these regions from our subsequent analyses to ensure the reliability of our conclusions.After removing duplicated reads by Picard (version 2.27.5) [28], samtools (version 1.9) [29] was used to calculate the sequencing coverage along the reference genome for each sample.The average sequencing coverage among the samples in this study was 0.19×.We filtered out the samples with average coverage lower than 0.15×.We comprehensively evaluated quality control (QC) factors, including GC content, alignment rate, repeat rate and Q20, on the aligned data.The specific QC statistics for each sample can be found in Supplementary Table 1.A PCA was employed to assess the impact of QC statistics on sample differentiation within both the normal and GDM groups.This analysis aimed to verify that the presence of QC statistics did not introduce any confounding effects when distinguishing between these variant groups.

Bias correction and coverage count
The genome reference was further divided by a non-overlap sliding window method with different window sizes.The window sizes were set to 50, 20, 10 and 5 k separately.Preceding the read counting process, a bias correction procedure was implemented using HMMcopy.This correction was applied to mitigate any potential inf luence stemming from variations in GC content and mappability, both of which could affect read depth.To visually demonstrate the impact of GC content and mappability on read depth, scatter plots were generated employing varying window sizes.Each data point within these plots corresponds to a sliding window, with the respective GC content (Supplementary Figure 2) or mappability values (Supplementary Figure 3) plotted on the xaxis and the read depth values on the y-axis.The mappability data source was acquired from the National Center of Biotechnology Information (NCBI) mappability track of hg38 (accessible via the website).Subsequent to bias normalization, it was observed that GC content and mappability displayed similar distributions in both the normal and GDM groups.This concurrence in distribution emphasizes that the observed variability in read depth between the two groups is not primarily driven by differences in GC content or mappability.We counted the coverages along the reference for each window size by summing up the coverages of base pairs inside the windows.We further visualized the scaled average coverage of normal samples, GDM samples and their diversity by scSVAS [30].

Feature selection, window pruning and coverage matrix imputation
We formed the window coverage matrix as shown in Figure 1B, with each row representing a sample and each column representing a window.The elements in the matrix are the window coverages.We normalized the elements with average coverages of samples row by row.To reduce the model complexity in the downstream binary classification task, we pruned out the windows by the following rules: (1) windows with over 80% samples with normalized coverage lower than 0.1 were dropped out; (2) windows with similar average coverages in normal and GDM samples (the difference is less than 0.01) were dropped out.To avoid the impact of read dropout (regions that cannot be sequenced) and sequencing errors due to low sequencing coverage, we further performed matrix imputation by SCOIT (https://github.com/deepomicslab/SCOIT) [31].SCOIT has been proven effective in single-cell data and can also be used for value imputation in matrix with other data sources.The imputed matrix was used for subsequential machine-learning analysis.

Deep learning model and parameter optimization
Our model was designed to identify GDM cases in the early gestation stage based on low-coverage sequencing data of cfDNA.The coverage matrix generated by the non-overlapping sliding window was the input of the model.We built a CNN-based deep neural network with a self-attention layer to achieve binary classification, as shown in Figure 1C.In detail, the input layer passes the matrix to two 1D convolution layers, with each convolution layer followed by a maximum pooling layer.After the convolution layers, we built a self-attention layer.A self-attention layer was joined to link the last convolution pooling layer and dense layers.Two fully connected dense layers with a dropout of 0.4 were then connected to the attention layer.Sigmoid was used as the activation function of the output layer to indicate whether the input sample was from a GDM patient.The deep learning model was constructed by the Keras package of Tensorf low [32].The model's hyperparameters, along with their corresponding explanations, are presented in Supplementary Table 4.To optimize these hyperparameters, we conducted a grid search, systematically exploring different combinations within specified upper and lower bounds.The bounds and search step for the hyperparameters are also detailed in Supplementary Table 4.

Training and evaluation
The dataset was randomly split into a training dataset (75%), a validation dataset (5%) and a testing dataset (20%).The model was trained using random weight initialization with a batch size 512 for 100 epochs.The Adam method was applied in the optimization process with 1e−5 as the learning rate and binary cross-entropy as the loss function.We adopted the dropout strategy to prevent overfitting.To evaluate the stability of our model, we conducted a 10-fold cross-validation on the dataset.During this process, we set the classification threshold to 0.5 by default, indicating that any prediction above this threshold was considered GDM.

Benchmarking settings
We compared the performance of our model on our dataset with other machine-learning models, including the Random Forest (RF) model and the Support Vector Machine (SVM) model.The models were implemented by the Python scikit-learn package (version 1.20) [33].The default parameters were adopted in the two models.We conducted a comparative analysis between our method and a methodology involving the identification of CNVs followed by the diagnosis of GDM by detecting pathogenic CNVs.
To achieve this, we employed ichorCNA [34], a robust tool designed for estimating the tumor fraction in cell-free DNA through ultralow-pass whole genome sequencing (ULP-WGS) at 0.1× coverage.Due to the absence of a normal panel, we performed CNV calling using window sizes of 10, 50, 500 and 1000 k, for which ichorCNA provides corresponding databases (gcWig file and mapWig file).The command-line executions for readCounter and ichorCNA followed the default parameters as outlined in the ichorCNA usage manual.

Performance evaluation
We used the confusion matrix by the validation dataset to measure the performance of the classification model.For a binary classification model, the confusion matrix is a 2 × 2 matrix with the ratio of true-positive cases, true-negative cases, falsepositive cases and false-negative cases.We further calculated the accuracy (the percentage of samples that are correctly predicted), precision (the percentage of predicted positive cases that are correctly predicted), recall (the percentage of actual positive cases that are correctly predicted) and F1-score (a harmonic means of precision and recall) from the confusion matrix.We demonstrated the performance of our classification model at all classification thresholds by the receiver operating characteristic (ROC) curve.We calculated the area under the ROC curve (AUC) to provide an aggregated measure of the model performance across all possible classification thresholds.We also applied direct bootstrapping by random sampling to estimate the uncertainty of the model.We performed 10 000 experiments for each window size with bootstrap sampling and calculated the average AUC with a confidential interval.For each experiment, we also measured the correlation coefficient to determine whether the model-predicted value has a relationship with the true label.The P-value was calculated by the 'pearsonr' function in the Python package 'scipy'.

Finding significant regions by self-attention layer
Our model trained the weights of the windows in the selfattention layer.The window weight indicates the positive contribution of the window in the binary classification of GDM.The model generated a 1 × N weight array in the training process, where N is the number of windows.We sort the windows by their weights to obtain the top 5000 windows with the most significant discrimination power in the binary classification and mark the regions on the reference indicated by the windows.

Gene annotation and enrichment
The selected cfDNA window regions were annotated to genes on Ensembl release 108.Then, we performed the Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses using the R package 'clusterProfiler'.Benjamini-Hochberg correction (FDR) was used for adjusted P-value calculation.Adjusted P-value less than 0.05 was considered significant.

Similar clinical characteristics of normal and GDM groups
Of the 5085 pregnant women, 1942 were diagnosed with GDM.The rose diagrams showed that the distributions of samples across all three clinical attributes, i.e. age, weight and sample time, were similar in the normal and GDM groups (Figure 2A-C).More than half of the samples were obtained from patients over 35 years old in normal and GDM groups.The GDM group had a higher proportion of patients with advanced maternal age compared to the normal group.Among the clinical attributes, patient weight had the most missing values.Most patients had a weight range between 50 and 70 kg for known values.Regarding the sample time, the majority of samples were collected between 12 and 18 weeks after gestation in both the normal (61.60%) and GDM (62.10%) groups.Given the common perception that advanced maternal age and higher body weight predispose pregnant women to a greater risk of developing GDM, our objective in conducting the PCA was to confirm the absence of any bias related to clinical information within our dataset.The PCA results revealed a mingling of samples from both groups, signifying that the two groups exhibited comparable clinical characteristics (Figure 2D).

Copy number diversity on different window size
After generating the coverage matrix with the number of aligned reads in each window, we performed sample filtering and window pruning to the coverage matrix.Six hundred fifteen samples (194 GDM samples and 421 normal samples) were filtered out as their sequencing coverage was lower than 0.15×.The nonoverlapping sliding window method generated 617 689, 308 861, 154 447 and 61 799 windows initially for the window sizes 5, 10, 20 and 50 k.After the window pruning process followed the rules mentioned in the method, 20 637, 11 288, 4520 and 2114 windows were left separately for different window sizes.The processed coverage matrixes were then fed to perform matrix imputation and binary classification.Figure 3 demonstrated the normalized diversity heatmap of the coverage diversities between GDM and normal samples at the whole genome sequencing level with variant window size.GDM groups showed a greater magnitude of copy number alterations compared to the normal group.Specifically, more copy number loss events on chromosomes 1-9 and more copy number gain events on chromosomes 12-19, 20 and X were observed in GDM groups.

Optimal window size and the performance of the model
The training model utilized 3353 (75%) samples (1326 GDM samples and 2027 normal samples) for training, 223 (5%) samples (80 GDM samples and 143 normal samples) for validation and 894 (20%) samples (342 GDM samples and 552 normal samples) for testing.The learning and loss curves for different window sizes are shown in Figure 4. We can observe that the model performance is relevant to the window size as smaller windows provide higher resolution in screening CNV.As illustrated in Figure 4A-D, the model started to converge at around epochs 10-40.When the window size is 10 k, the model provided the best validation accuracy, around 86%.The model tended to have overfitting with the dataset generated by window size 5 k as smaller window sizes introduced more candidate windows as features.The training loss shown in Figure 4E-H has similar results with learning curves.
To assess the stability of our model, we conducted 10-fold cross-validation experiments using different window sizes.The results are presented in Figure 5A-D, which showcases the ROC curves along with the corresponding AUC values.In each subfigure, the black line represents the ROC curve generated by the test dataset, while the gray lines represent the ROCs obtained from cross-validation.We also annotated the correlation and Pvalues between the model's predictions and the true values in the test dataset for each subfigure.Notably, all P-values were below 0.05, indicating a strong and statistically significant correlation between the predictions and the true values.Furthermore, the ROC curves align consistently with the learning curve derived from the validation data, demonstrating that our model achieves optimal performance when the window size is set to 10 k.The model performed best when the window size is 10 k with an accuracy of 88.14%, sensitivity of 84.07%, recall of 93.04%, F1score of 88.33% and AUC of 96.49%.Table 1 shows the detailed confusion matrix and assessment metrics.
We also performed bootstrapping to evaluate the robustness of the model with 10 000 experiments on each window size.Figure 5E and F demonstrated the performance of our model with testing data random sampling.The boxplot in Figure 5E shows the AUCs at variant window sizes.The mean AUCs for window sizes 50, 20, 10 and 5 k are 0.92, 0.94, 0.96 and 0.93, respectively.The window size of 10 k provided the best AUC and smallest variation among the experiments.Figure 5F shows the distribution of AUCs with window size 10 k among the experiments.The density forms a normal distribution with [0.952, 0.968] as a 95% confidential interval.
We also compared the performance of our model with machine-learning models, including SVM and RF, as shown in Table 1.All of the results are statistically significant   We also compared our method with methodology of involving the identification of CNVs.We have incorporated the experiments involving ichorCNA with CNV screening window size 10, 50, 500 and 1000 k.We conducted separate screenings for each of these window sizes.Regrettably, when utilizing a 10 k window size, ichorCNA encountered an error and failed to produce valid outcomes, consistently displaying the error message 'zero-width neighborhood.make span bigger.' across all samples.In contrast, for other window sizes, ichorCNA yielded meaningful integerbased CNV results.
Before applying our deep learning model to the dataset, we executed a feature pruning procedure to reduce the number of CNV windows.This process adhered to two criteria: (1) the exclusion of windows where over 90% of samples (across both normal and GDM groups) shared the same CNV value; and (2) the elimination of windows where the disparity in average CNV values between the normal and GDM groups was below 0.1.This culling procedure led to the retention of 3846, 595 and 8 valid windows for the 50, 500 and 1000 k window sizes, respectively.The 1000 k window size was subsequently disregarded due to an insufficient number of viable windows, while the 50 and 500 k sizes proceeded to be utilized for our DL model.Supplementary Table 5 presents  the performance metrics of the performance for DL model with ichorCNA called CNV dataset, and Supplementary Figure 3 portrays the learning curve.The window size 50 k provided the best results, with accuracy 69.84%, precision 67.69%, recall 72.13%, F1score 69.84% and AUC score 79.89%.

cfDNA fragments covered diabetes-associated genes
We joined a self-attention layer in the CNN-based model to measure the importance of the input windows.The self-attention layer generated a weight array indicating the corresponding window's discrimination power in the binary classification of GDM.Here, we utilized the results obtained by window size 10 k as it provided the best performance.We sorted the windows by weight and extracted 5000 windows for further analysis.As shown in Figure 3, the observed differences in coverage highlight the potential utility of these regions for accurate classification and further emphasize their relevance in understanding the characteristics associated with GDM.We annotated 2190 genes in the window regions (Supplementary Table 2).The previously discovered alpha-and beta-defensin genes, DEFA1, DEFA3 and DEFB1 [22], also occurred in our gene list.
We performed GO and KEGG enrichment analysis on the annotated genes.The genes covered by cfDNA fragments were enriched in 327 GO terms and 18 KEGG pathways (P adj < 0.05; Figure 6, Tables 2-3 and Supplementary Table 3).Among GO terms, glutamate (GO:0007215; adjusted P-value = 2.34e−5) and ionotropic glutamate receptor signaling pathways (GO:0035235; adjusted P-value = 1.52e−5) were associated with type I and type II diabetes [23,24].Studies have shown that diabetes partially loses its glucoregulatory mechanism due to Ca 2+ channel activity and gene expression decrease in alpha cells [35,36].In addition, a complete glucagon response to hypoglycemia relies on positive autocrine feedback mediated by extracellular glutamate, which acts on ionotropic glutamate receptors of the AMPA/kainite type [37].Besides, forebrain development (GO:0030900; adjusted P-value = 4.84e−7) and forebrain cell migration (GO:0021885; adjusted P-value = 2.92e−4) are descendant nodes of brain development in GO Term Tree.The changes in brain signaling systems have been reported to be essential in the etiology and pathogenesis of type II diabetes mellitus and metabolic syndrome [38].GTPase regulator activity (GO:0030695; adjusted Pvalue = 3.55e−8) participated in post-translational modifications that were also linked to diabetes mellitus [39].Genes were enriched in the type II diabetes mellitus pathway (hsa04930; adjusted P-value = 1.76e−4), consistent with the reported correlation between type II diabetes mellitus and PDM [25,26].Consistent with the GO term enrichment analysis, we can observe a glutamatergic synapse pathway (hsa04724; adjusted P-value = 8.07e−4) that consists of glutamate localization inside presynaptic vesicles and glutamate receptors on the postsynaptic membrane.Other enriched KEGG pathways that are known related to type II diabetes mellitus include adherens junction (hsa04520, adjusted P-value = 8.82e−3), MAPK signaling pathway (hsa04010; adjusted P-value = 8.82e−3), pancreatic secretion AMPK signaling pathway (hsa04972; adjusted P-value = 2.13e−3), platelet activation (hsa04611; adjusted P-value = 2.54e−2), AMPK signaling pathway (hsa04152; P-value = 3.28-2) and insulin resistance (hsa04931; adjusted P-value = 3.33-2) [40][41][42].The adherens junction plays a role in the pathogenesis of type II diabetes mellitus through repercussion for the endocrine pancreas, intestinal barrier and kidney dysfunctions [40].

DISCUSSION
GDM is a prevalent pregnancy complication that can have detrimental effects on maternal and fetal health.However, GDM is usually diagnosed after 24 weeks of gestation, which is a delayed detection.Therefore, it is necessary to develop consensus diagnostic criteria for GDM in the early gestational stage.To address this issue, this study proposed a convolutional neural network (CNN)based deep learning model that uses low-coverage sequencing data of cell-free DNA (cfDNA) samples to classify GDM in the gestation stage as early as 12 weeks of gestation.To construct our proposed GDM classification model, CNVs needed to be called from the aligned data.For this purpose, we employed the widely recognized CNV calling software tailored for low-depth data, ichor-CNA.The obtained count of valid windows was limited for all selected window sizes.The window size of 50 k yielded the most promising outcomes through our deep learning model, boasting an accuracy of 69.84%, precision of 67.69%, recall of 72.13%, F1-score of 69.84% and an AUC score of 79.89%.Subsequently, we adopted a non-overlapping sliding window methodology that aggregated the coverages of base pairs within each window from the aligned data.This procedure generated a window coverage matrix, which served as the input for our CNN-based deep neural network.The DL model achieved high accuracy, with an AUC, accuracy, precision, recall and F1-score of 96.49, 88.14, 84.07, 93.04 and 88.33%, respectively.The proposed model demonstrated high stability, evidenced by the mean AUC of 0.96 from bootstrapping with 10 000 experiments.Compared to identifying CNVs from extremely low-coverage samples, our method allows us to achieve higher-resolution results in cases where non-integer CNV signals are present within the CNV segments, leading to enhanced sensitivity and accuracy in distinguishing between normal and GDM samples.We suggest that the proposed model can be useful in the clinical screening of GDM, potentially leading to better health outcomes for both mothers and fetuses.
It is crucial to consider interpreting the biological mechanism underlying our model.Previous research has shed light on the segmental nature of cfDNA footprints, which exhibits sequence segment bias toward genome regions.The relatively minor diversity observed in both the GDM and normal groups could potentially be attributed to this segmental DNA originating from maternal conditions.To delve deeper, we assessed the discrimination scores of specific genomic windows.These scores serve as indicators of the model's ability to distinguish between GDM and normal patients.By focusing on regions that are either frequently present or notably absent in GDM patients, we can increase the likelihood of identifying biologically significant regions.Furthermore, our analysis included enrichment studies, as well as an exploration of GO terms and KEGG pathways associated with these highlighted regions.These steps are essential for understanding the functional implications of the identified regions and their potential roles in the context of gestational diabetes mellitus (GDM).We believe that our proposed model can be easily adapted to other diseases with low-coverage sequencing data.Despite demonstrating high accuracy in classifying GDM in the early gestation stage, the study has limitations.The model was tested on our dataset, and more data from other sources and variant demographic groups following a similar sequencing protocol to our dataset should be included to strengthen the credibility of the model.In addition, the CNV genes identified in the study were not verified experimentally due to the difficulty in the follow-up of GDM cases.Further studies are required to validate the identified genes through experimental verification.Regions frequently present or absent in GDM patients provide higher discrimination power in the binary classification task and, thereby, are more likely to be identified.It is important to acknowledge that the validation of these findings necessitates high-coverage sequencing data derived from maternal cfDNA.This validation process is a critical next step in our research but may require further efforts in future work to obtain the required data for a more comprehensive and conclusive assessment.
Nonetheless, the study highlights the effectiveness of deep learning and convolutional neural networks in scanning CNVs with low-coverage sequencing data for GDM classification in the early gestation stage.

Key Points
• We proposed a convolutional neural network (CNN) model for classifying gestational diabetes mellitus (GDM) from 5085 individuals with low-coverage cfDNA sequencing data in early gestation stage.Among the recruited individuals, 1942 were diagnosed as GDM.The model achieved an accuracy of 88.14%, precision of 84.07%, recall of 93.04%, F1-score of 88.33% and AUC of 96.49%.• Sequencing coverage matrix generated by nonoverlapping sliding window screening provide candidate copy number variation (CNV) regions in low-coverage sequencing data.• Self-attention architecture revealed potential GDMassociated genes and CNV regions.Gene Oncology (GO) term and KEGG enrichment analysis on the potential genes identified diabetes-related pathways.• Our machine-learning model showed high efficiency on predicting GDM in early pregnancy stage, which may aid in early intervention of GDM and prevention of adverse pregnancy outcomes.

Figure 1 .
Figure 1.An overview of the GDM classification model.(A) The pipeline for processing sequencing data to generate count matrix for deep learning model.(B) An example for the count matrix constructure.(C) The deep learning architecture of our model.

Figure 2 .
Figure 2. Clinical information of the collected samples.(A) The age distribution of the normal group and GDM group.(B) The weight distribution of the normal group and GDM group.(C) The sample time distribution of the normal and GDM group.(D) PCA scatter plot of normal samples and GDM samples.

Figure 3 .
Figure 3. Copy number diversity on different window size along the whole chromosome.

Figure 4 .
Figure 4. Model learning performance on different window size.(A-D) Learning accuracy curve for window size 50, 20, 10 and 5 k, respectively.(E-H) Learning loss curve for window size 50, 20, 10 and 5 k, respectively.

Figure 5 .
Figure 5. Model performance on test dataset.(A-D) ROC curve for window size 50, 20, 10 and 5 k, respectively.The black line represents the ROC curve generated by the test dataset, while the gray lines represent the ROCs obtained from cross-validation.(E) The AUC distribution of 10 k window size on 10 000 bootstrapping experiments.(F) The AUCs of variant window sizes on bootstrapping experiments.
Deep Learning model.

Figure 6 .
Figure 6.The results of GO term and KEGG pathway enrichment of annotated genes.(A) Enriched GO terms with adjusted P-value smaller than 0.05.(B) Enriched KEGG pathways with adjusted P-value smaller than 0.05.

Table 1 :
Details on confusion matrix and performance measurement metrics of three machine-learning models with different window sizes

Table 2 :
List of top 10 GO terms at each oncology

Table 3 :
List of enriched KEGG pathways