DNA Repair Biomarker for Lung Cancer Risk and its Correlation With Airway Cells Gene Expression.

Background
Improving lung cancer risk assessment is required because current early-detection screening criteria miss most cases. We therefore examined the utility for lung cancer risk assessment of a DNA Repair score obtained from OGG1, MPG, and APE1 blood tests. In addition, we examined the relationship between the level of DNA repair and global gene expression.


Methods
We conducted a blinded case-control study with 150 non-small cell lung cancer case patients and 143 control individuals. DNA Repair activity was measured in peripheral blood mononuclear cells, and the transcriptome of nasal and bronchial cells was determined by RNA sequencing. A combined DNA Repair score was formed using logistic regression, and its correlation with disease was assessed using cross-validation; correlation of expression to DNA Repair was analyzed using Gene Ontology enrichment.


Results
DNA Repair score was lower in case patients than in control individuals, regardless of the case's disease stage. Individuals at the lowest tertile of DNA Repair score had an increased risk of lung cancer compared to individuals at the highest tertile, with an odds ratio (OR) of 7.2 (95% confidence interval [CI] = 3.0 to 17.5; P < .001), and independent of smoking. Receiver operating characteristic analysis yielded an area under the curve  of 0.89 (95% CI = 0.82 to 0.93). Remarkably, low DNA Repair score correlated with a broad upregulation of gene expression of immune pathways in patients but not in control individuals.


Conclusions
The DNA Repair score, previously shown to be a lung cancer risk factor in the Israeli population, was validated in this independent study as a mechanism-based cancer risk biomarker and can substantially improve current lung cancer risk prediction, assisting prevention and early detection by computed tomography scanning.

length substrate into a defined shorter oligonucleotide, and the ratio between the two represents the DNA repair activity. The assays were performed on a robotic platform (Tecan Freedom EVO 200), and the reaction products analyzed by capillary gel electrophoresis, using the ABI3130XL genetic analyzer (Applied Biosystems), and the GeneMapper (Applied Biosystems) and PeakAnalyzer (Robiotec, Rehovot, Israel) software.

Bronchial and nasal sample collection
Bronchial brushings. During diagnostic bronchoscopy procedures three bronchial brushings, designed to gently remove epithelial cells with minimal bleeding, were performed using bronchial brushes (Olympus Medical, Southend, UK). Brushings using disposable cytology brushes (BC-202D-5010 Olympus Japan) were taken from geographically different areas of macroscopically uninvolved main bronchus or lobar bronchi contralateral to the suspected lesion.
Nasal curette samples. Samples of nasal airway epithelium were taken under direct vision from the inferior part of the inferior turbinate of each nostril using nasal curettes (ASI Rhino-Pro; Arlington Scientific Inc.).

RNAseq
Tissue samples from bronchial brushings and nasal curettes were stored in 500µl RNALater overnight at 4 o C, and then at -80 o C for longer-term storage. RNA was extracted using Qiagen MiRNeasy columns according to manufacturer's protocols. Briefly, bronchial brushes were rinsed in PBS, brushes transferred into 700µl Qiazol and cells lysed by vortexing twice for 30 seconds.
For nasal samples the RNALater containing nasal tissue (500µl) was diluted with 2ml of PBS and spun at 10,000 rpm for 10 min. The cell pellet was lysed by resuspension in 700µl Qiazol. For both types of samples, the Qiazol lysate was applied to a QiaShredder tube (#217004) and spun at 13,000 rpm for 2 mins. The homogenate was kept at room temperature for 5 mins, followed by chloroform extraction using PhaseLock tubes. Nucleic acids in the aqueous phase were precipitated using 1.5 volumes 100% ethanol and DNA was digested using DNAse I. Finally, RNA was isolated from the mixture using RNAeasy mini spin columns. RNA was quantified using a Qbit measurement and quality assessed using an Agilent Bioanalyzer. For samples with a RIN greater than 7, a total of 500ng of RNA was used for Illumina TruSeq Library generation.
Sequencing was carried on a HiSeq 2500 Illumina sequencers. Sequencing was carried out in two separate multiplexed experiments. Alignment was carried out on the human genome version GRCh37 using the Tophat alignment tool. On average each library contained above 20 Million reads. Count matrices for cases and controls were processed using DESeq2 (5).

Analysis of RNAseq data
As described in the manuscript, analyzing the relationship between the DNA repair score and gene expression, as determined by RNAseq, uncovered that low DNA repair score correlates with upregulation of immune system pathways in lung cancer patients, but not in control subjects. The following sections describe the methods and results of quality control (QC) procedures, data cleaning and statistical analyses implemented in the main manuscript. The RNAseq dataset included read counts from 669 nasal and bronchial samples derived from 490 subjects, out of which DNA repair score values were available for 213 subjects. The 669 samples' RNAseq dataset (sequencing batches: 494 samples in experiment 1 and 175 in experiment 2) was used in its entirety for the QC analysis.

Quality control analysis of the RNAseq data
Number of detected genes. Gene transcripts were defined as detected if it had counts of more than 10 reads. Genes with <10 reads were filtered out. Samples with less than 13,000 detected genes were filtered out from the analysis. 97.8% of the samples analyzed in experiment 1 and 98.3% in experiment 2 had >13,000 detected genes.
Experimental batches. We used Principal Component Analysis (PCA) to detect the major sources of variation in the data. As expected, tissue type -nasal (NS) versus bronchial (BR), explains most of the variance followed by sequencing batches (experiment 1 versus experiment 2; Supplementary Fig. 1). We did find a few samples that seem to reside in the wrong clusters and removed them from further analysis.
Gender effect. A good quality RNAseq dataset should enable identifying gender differences in gene expression. Stratifying by sequencing batch and tissue type we observed the effect of gender in the PCA of ~100 most variable genes. Several mismatches that were found were removed from further analysis (not shown).

Differential expression analysis
Analysis was performed on the combined dataset obtained from the two experimental batches (Experiments 1 & 2). Similar results were obtained when only the bigger dataset of Experiment 1 was used. The RNAseq data was regressed on DNA repair scores using DESeq2 (5), a regression tool optimized for RNAseq data. Analysis was performed separately on the different tissues (nasal/bronchial) and disease state (cases / controls), with experimental batch, age, gender, smoking status (never, former and current smokers) and cancer histology (in cases) as adjusting factors. With a False Discovery Rate (FDR) threshold of 0.01, we could find very few genes whose expression correlated with the DNA repair score, as follows: case bronchial samples, 0; case nasal samples, 8; control nasal samples, 1; Nevertheless, it is notable that in the cases group (but not in control subjects) there is an enrichment of genes whose expression increases with decreasing DNA repair OMA score values (left, negative values part of the Volcano plot in Supplementary Fig. 2D). Hypothesizing that the correlation signal might be distributed over many genes, with each gene having a small effect size, we employed gene set enrichment analysis (GSEA; (6)), testing for pathways enriched with genes that are correlated with the DNA repair OMA score.

Gene set enrichment analysis
The list of genes, ranked by their statistics (as reported in DESeq2) was analyzed by GSEA (GSEA 3.0) in order to identify whether there is an over-representation of genes belonging to specific pathways (annotated by Gene Ontology; GO terms; pathway Gene Ontology downloaded from MSigDB (6),c5.all.v6.1). Supplementary Table 2 lists the thirty most significant pathways that were identified. For each pathway, the enrichment algorithm finds the maximum enrichment score, reflecting the degree to which the genes in the set are over-represented at either the top (positive correlation) or bottom (negative correlation) of the list, and calculates the FDR q-value (the false discovery rate), which is the estimated probability that the enrichment score represents a false positive finding. The pathways were manually curated and divided into 3 groups: Immune system-related pathways, Cell Cycle pathways and Other pathways (see legend to Supplementary Table   3). Supplementary Table 3 summarize the pathways that were found to be significantly enriched in nasal samples by GSEA (q-value<0.001, a very strict value as explained in ref (6)), showing a strong negative association of Immune system-related pathways in the cases group, with essentially no signal in the control groups (see also Fig. 5 in the manuscript). Another set of pathways that exhibited negative correlation with the OMA score represents cell cycle pathways, which unlike the Immune system-related pathways seem to be enriched in both cases and controls (Supplementary Table 3).
To visualize the differences in the correlations between DNA repair score and immunesystem pathways, versus DNA repair score and 'Other' pathways, we highlight in differential expression volcano plots two pathways, selected for being relative big and with roughly similar size (~350 genes): the inflammatory response pathway (GO_INFLAMMATORY_RESPONSE, which is an immune system pathway), and the skeletal system development pathway (GO_SKELETAL_SYSTEM_DEVELOPMENT, which belongs to 'Other pathways'). Simulations to test the robustness of the correlation between a low DNA repair OMA score and activity of immune system pathways Extreme OMA score trimming analysis. In this section we repeated the analysis for the nasal tissue samples sequenced in experiment 1, except that we excluded samples with OMA scores at the tails of the OMA distribution, removing 3.5% tail from each side of the OMA scores.
The effect of extreme trimming is presented in Supplementary Fig. 3a, showing the upregulation of the immune pathways also with the trimmed OMA score (compare to Fig. 5

in manuscript).
Sub-sampling analysis. To get an estimate for the robustness of the results to a more general sampling noise, we conducted 100 iterations of random sub-sampling of subjects and repeated the regression in each iteration. The RNAseq data of the selected random groups of subjects (at 80% of the sample size) were regressed on OMA scores, followed by gene set enrichment analysis, and the number of significant immune system-related pathways (at a q-value<0.001) was determined. Supplementary Fig. 3b shows that 95% of the simulations have more than 117 significant immune system-related pathways (with median value of 137). This analysis is an indication that the results are not sensitive to sampling noise.

Methods and Tools for RNAseq analysis
All Statistical analysis was conducted in R version 3.2.1 (7). All figures were generated with ggplot2 package (8). Data normalization and regression analysis was done with DESeq2 (5).

Calculation of 5-year risk of lung cancer
The basis of the calculation was the Liverpool Lung Project (LLP) risk model (9). The paper describes a linear logistic regression model for the probability of developing lung cancer within 5 years that depends on several factors: age, sex, smoking duration, prior diagnosis of pneumonia, occupational exposure to asbestos, prior diagnosis of malignant tumor, and family history of lung cancer. To illustrate the effect of the DNA repair score (OMA score) on the risk of lung cancer we did the following: (a) We chose the profile of a male or female aged 65y who had one of the following smoking histories: never smoked, smoked for 10 years, smoked for 30 years or smoked for 50 years, and who had none of the other risk factors in the LP model (i.e. no prior diagnosis of pneumonia, no occupational exposure to asbestos, no prior diagnosis of a malignant tumor and no family history of lung cancer). (b) We assumed that the distribution of OMA DNA repair scores was independent of the risk factors in the LLP model. This is supported by data from the current and previous studies that have shown that the OMA score has small statistically non-significant correlations with age, sex and smoking history. (c) We assumed also that none of the risk factors in the LLP model modify the effect of OMA score on lung cancer risk. This is also supported by data from the current and previous studies that have shown small statistically non-significant interactions between OMA and age, sex and smoking history. (d) Under these assumptions we adapted the LLP model to include the OMA score as an extra factor. The betacoefficient for the OMA score in this model was log(2.5), where 2.5 was the cross-validated odds ratio estimate for the DNA repair score (see Table 3 of the main paper). For a 65-year old male with the above-mentioned profile, the modified model was: logit (P) = -5.56 + beta-smok -log(2.5) × (OMA -3.553). (1) In this equation, P is the probability of lung cancer diagnosis within the next 5 years, the value of -5.56 is taken from Table A1 of (9), the value of beta-smok is 0, 0.769, 1.452 or 2.507 respectively for never-smoked, or smoked for 10y, 30y or 50y (taken from Table 2 of (9)), and the value of 3.553 was calculated by us so as to yield an average risk in our control group equal to the average risk in the Liverpool population of males aged 65y in the years 2002-4 (see Table   A1 of (9)).
The model for a 65-year old female was that given in Equation (A1) except that -5.56 was replaced by -5.99 (see Table A1 (9)) and 3.553 was replaced by 3.555 (our calculation).

(e) Equation (1) enables the 5-year lung cancer risk to be calculated for a person resident in
Liverpool with one of our profiles and a specific value of the OMA score (to be entered into the equation). To calculate the average risk for persons with that same profile but with OMA scores below or above a given percentile (5 th , 10 th or 75 th , as given in Supplementary Table 4), we used numerical integration over the distribution of OMA scores, assuming the DNA repair score had a normal distribution with mean value of 4.00 (see the control group mean in Table 1 in the main manuscript) and standard deviation 0.98 (the control group's SD).

Supplementary Tables
Supplementary Table 1 Supplementary Figure 3. Analysis of the robustness of the correlation between low DNA repair OMA score and upregulation of immune-system related pathways. A. Effect of extreme OMA score trimming on the enrichment of immune system pathways with low DNA repair score using gene set enrichment analysis (GSEA). Trimming was performed separately for nasal samples from cases and controls by removing samples with DNA repair OMA values in the 3.5% extreme OMA values from both sides. Genes from the trimmed sub-groups were ranked by the RNAseq2 analysis according to their correlation to the DNA repair score, and analyzed by GSEA to identify pathways (using GO terms) which significantly correlate (negative or positive correlation) with the DNA repair score. The figure represents all Immune system related pathways (depicted by the list of keywords presented in Supplementary Table 3 Figure 4. Comparison of association of low DNA repair score with lung cancer in the UK and Israeli (IL) studies. Results of logistic regression, in which odds ratios were estimated for the continuous DNA repair score variable and categorized into 3 groups. IL, results taken from the Israeli study (2); UK, results were taken from Table 3 in the manuscript.