Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
April 23, 2020 1/16
Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning Alejandro Lopez-Rincon1, Alberto Tonda2, Lucero Mendoza-Maldonado3, Daphne
G.J.C. Mulders4, Richard Molenkamp4, Eric Claassen5, Johan Garssen1,6, Aletta D. Kraneveld1 1. Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science,
Utrecht University, Universiteitsweg 99, 3584 CG Utrecht, the Netherlands 2. UMR 518 MIA-Paris, INRAE, c/o 113 rue Nationale, 75103, Paris, France 3. Hospital Civil de Guadalajara ”Dr. Juan I. Menchaca”. Salvador Quevedo y Zubieta 750,
Independencia Oriente, C.P. 44340 Guadalajara, Jalisco, Mexico 4. Department of Viroscience, Erasmus Medical Center, Rotterdam, the Netherlands 5. Athena Institute, Vrije Universiteit, De Boelelaan 1085, 1081 HV Amsterdam, the Netherlands. 6. Department Immunology, Danone Nutricia research, Uppsalalaan 12, 3584 CT Utrecht, the
Netherlands (Submitted: 23 April 2020 – Published online: 27 April 2020)
DISCLAIMER This paper was submitted to the Bulletin of the World Health Organization and was posted to the COVID-19 open site, according to the protocol for public health emergencies for international concern as described in Vasee Moorthy et al. (http://dx.doi.org/10.2471/BLT.20.251561). The information herein is available for unrestricted use, distribution and reproduction in any medium, provided that the original work is properly cited as indicated by the Creative Commons Attribution 3.0 Intergovernmental Organizations licence (CC BY IGO 3.0).
RECOMMENDED CITATION
Lopez-Rincon A, Tonda A, Mendoza-Maldonado L, Mulders D.G.J.C., Molenkamp R, Claassen E, et al. Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning. [Preprint]. Bull World Health Organ. E-pub: 27 April 2020. doi: http://dx.doi.org/10.2471/BLT.20.261842
April 23, 2020 2/16
Abstract
Objective
Deep learning techniques can deliver remarkable results in biology, being able to flawlessly perform complex tasks such as differentiating virus strains of the same family. Nevertheless, most of the models created by these algorithms are black boxes, and their decision process is impervious to human interpretation. In this paper, we apply techniques of explainable AI to the task of discovering representative genomic sequences in SARS-CoV-2, to finally generate specific primers.
Methods
Starting from a convolutional neural network trained on 553 sequences from the 2019 Novel Coronavirus Resource database, we distinguish the genome of virus strains from the Coronavirus family with considerable accuracy (> 98%). Next, we analyze the network’s behavior on every sample, to discover sequences used by the model to classify SARS-CoV-2. Then, using feature selection algorithms we find sequences exclusive to SARS-CoV-2. Finally, using the identified sequences we develop a SARS-CoV-2 specific primer set and test it using a conventional PCR.
Findings
A first validation, performed on 583 samples from the NGDC repository and 20,604 from the NCBI repository, show that we can identify SARS-CoV-2 from more than 900 other viruses with a high classification accuracy (> 99%) using only 12 sequences of 21 base pairs. Then, we compute the frequency of appearance of these 12 sequences in 9,294 samples from the GISAID repository, observing frequencies ranging from 95.24% to 99.73%. Finally, we use one of the sequences as a forward primer, generating a primert. Testing the primer set using a conventional PCR delivers a sensibility similar to routine diagnostic methods, and 100% specificity when comparing to other coronaviruses and differentiating between SARS-CoV-2 positive patients (n=5) and controls (n=3).
Conclusion
Our methodology, combining deep learning, viromics and primer design,to develop accurate detection of SARS-CoV-2 by means of qPCR proved to be effective. This approach of detection using cDNA or DNA sequences can be applied to a range of different problems, like mutations in cancer, and autoimmune diseases. Finally, considering the possibility of future pandemics this technology will be suitable to fast and accurately create methods for diagnostics to combat the spread.
Introduction 1
The Coronaviridae family presents a positive sense, single-strand RNA genome. These 2
viruses have been identified in avian and mammal hosts, including humans. 3
Coronaviruses have genomes from 26.4 kilo base-pairs (kbps) to 31.7 kbps, with G + C 4
contents varying from 32% to 43%; human-infecting coronaviruses belonging to this 5
family include SARS-CoV, MERS-CoV, HCoV-OC43, HCoV-229E, HCoV-NL63 and 6
HCoV-HKU1 [1]. In December 2019, SARS-CoV-2, a novel, human-infecting 7
Coronavirus was identified in Wuhan, China, using Next Generation Sequencing 8
(NGS) [2]. 9
As a typical RNA virus, new mutations appears every replication cycle of 10
Coronavirus, and its average evolutionary rate is roughly 10-4 nucleotide substitutions 11
April 23, 2020 3/16
per site each year [2]. In the specific case of SARS-CoV-2, RT-qPCR testing using 12
primers in ORF1ab and N genes have been used to identified the infection in humans [3]. 13
However, this method presents a high false negative rate (FNR), with a detection rate 14
of 30-50% [4]. This low detection rate can be explained by the variation of viral RNA 15
sequences within virus species, and the viral load in different anatomic sites [5]. 16
Population mutation frequency of site 8,872 located in ORF1ab gene and site 28,144 17
located in ORF8 gene gradually increased from 0 to 29% as the epidemic progressed [6]. 18
As of March 27th of 2020, the new SARS-CoV-2 has 462,684 confirmed cases across 19
almost all countries, with 250,287 cases in the European region [7]. In addition, 20
SARS-CoV-2 has an estimated mortality rate of 3-4%, and it is spreading faster than 21
SARS-CoV and MERS-CoV [8]. SARS-CoV-2 assays can yield false positives if they are 22
not targeted specifically to SARS-CoV-2, as the virus is closely related to other 23
Coronavirus organisms. In addition, SARS-CoV-2 may present with other respiratory 24
infections, which make it even more difficult to identify [9, 10]. 25
Thus, it is fundamental to improve existing diagnostic tools to contain the spread. 26
For example, diagnostic tools combining computed tomography (CT) scans with deep 27
learning have been proposed, achieving an improved detection accuracy of 82.9% [11]. 28
Another solution for identifying SARS-CoV-2 is additional sequencing of the viral 29
complementary DNA (cDNA). We can use sequencing data with cDNA, resulting from 30
the PCR of the original viral RNA; e,g, Real-Time PCR amplicons (Fig. 1) to identify 31
the SARS-CoV-2 [12]. 32
Classification using viral sequencing techniques is mainly based on alignment 33
methods such as FASTA [13] and BLAST [14]. These methods rely on the assumption 34
that DNA sequences share common features, and their order prevails among different 35
sequences [15, 16]. However, these methods suffer from the necessity of needing base 36
sequences for the detection [17]. Nevertheless, it is necessary to develop innovative 37
improved diagnostic tools that target the genome to improve the identification of 38
Fig 1. PCR Amplicons sequencing procedure.
pathogenic variants, as sometimes several tests, are needed to have an accurate 39
diagnosis. As an alternative deep learning methods have been suggested for 40
classification of DNA sequences, as these methods do not need pre-selected features to 41
identify or classify DNA sequences. Deep Learning has been efficiently used for 42
classification of DNA sequences, using one-hot label encoding and Convolution Neural 43
Networks (CNN) [18, 19], albeit the examples in literature are featuring DNA sequences 44
of length up to 500 bps, only. 45
In particular, for the case of viruses, NGS genomic samples might not be identified 46
by BLAST, as there are no reference sequences valid for all genomes, as viruses have 47
April 23, 2020 4/16
high mutation frequency [20]. Alternative solutions based on deep learning have been 48
proposed to classify viruses, by dividing sequences into pieces of fixed length, ranging 49
from 300 bps [20] to 3,000 bps [21]. However, this approach has the negative effect of 50
potentially ignoring part of the information contained in the input sequence, that is 51
disregarded if it cannot completely fill a piece of fixed size. 52
Given the impact of the world-wide outbreak, international efforts have been made 53
to simplify the access to viral genomic data and metadata through international 54
repositories, such as The 2019 Novel Coronavirus Resource (2019nCoVR) repository [6], 55
the National Center for Biotechnology Information (NCBI) repository [22] and the 56
Global Initiative on Sharing All Influenza Data (GISAID) repository [23], expecting 57
that the easiness to acquire information would make it possible to develop medical 58
countermeasures to control the disease worldwide, as it happened in similar cases 59
earlier [24–26]. Thus, taking advantage of the available information of international 60
resources without any political and/or economic borders, we propose an innovative 61
system based on viral gene sequencing. 62
Starting from a CNN trained to separate Coronavirus samples belonging to different 63
strains [27], including SARS-CoV-2, we apply techniques inspired by explainable AI in 64
computer vision to discover representative cDNA sequences that the network uses to 65
classify SARS-CoV-2. We then validate the discovered sequences on datasets not used 66
during the training of the CNN, and show how to exploit them to create a novel, highly 67
informative feature space. Experimental results show that the new feature space leads 68
traditional, simple classifiers, to correctly assess SARS-CoV-2 with remarkable accuracy 69
(> 99%). A few of the discovered sequences also possess the correct characteristics for 70
potentially becoming primers, as just checking for their presence in samples is enough to 71
identify SARS-CoV-2. 72
Materials and methods 73
The CNN used during all the experiments is composed of one convolutional layer with 74
12 filters (each with window size 21) with maxpooling (with pool size and stride 148), a 75
fully connected layer (196 rectified linear units with dropout probability 0.5), and a 76
final softmax layer with 5 units, to differentiate the different classes of Coronavirus 77
strains. The optimized used is Adaptive Momentum (ADAM) [28], with learning rate 78
10−5 and a batch size of 50 samples, run for 1,000 epochs [27]. 79
The convolutional layer of the network, in simple terms, is analyzing subsequences of 80
21 base pairs that can appear in different points of the virus genome. We selected 21 as 81
primers have a length of 18-22 bps normally. The pool size of the maxpooling represents 82
the interval in which a specific 21-bps sequence can be recognized (in this case, 148 83
positions). Through the training process, the convolutional layer is de-facto learning 84
new features to characterize the problem, directly from the data. In this specific case, 85
the new features are 21-bps sequences that can more easily separate different virus 86
strains (Fig. 2). By analyzing the result of each filter in a convolutional layer, and how 87
its output interacts with the corresponding max pooling, it is possible to detect 88
human-readable sequences of base pairs that might provide domain experts with 89
relevant information. It is important to notice that these sequences are not bound to 90
specific locations of the genome; thanks to its structure, the CNN is able to detect them 91
and recognize their importance even if their position is displaced in different samples. 92
April 23, 2020 5/16
Fig 2. Overall procedure to find the specific SARS-CoV-2 sequences to create a primer set.
The trained CNN described above obtained a mean accuracy of 98.75% in a 10-fold 93
cross-validation [27]. Once the network is trained, in a first analysis, we plot the inputs 94
and outputs of the convolutional layer, to visually inspect for patterns. As an example, 95
in Fig. 3 we report the visualization of the first 1,250 bps of each of the 553 samples 96
from the NGDC [6] repository (Table 1). 97
Each filter slides a 21-bps window over the input, and for each step produces a single 98
value. The output of a filter is thus a sequence of values in (0, 1): Fig. 4 shows the 99
output of the first filter in the CNN, for the first 1,250 bps of all 553 samples, mapped 100
to a monochromatic image where the closest a value is to 1, the whiter the 101
corresponding pixel is. 102
Table 1. Organism, assigned label, and number of samples in the unique sequences obtained from the repository [6]. We use the NCBI organism naming convention [29].
Organism Label Number of Samples
SARS-CoV-2 0 66 MERS-CoV 1 236 HCoV-OC43 2 136 HCoV-229E 2 22 HCoV-EMC 2 6 HCoV-4408 2 2 HCoV-NL63 3 58 HCoV-HKU1 3 17 SARS-CoV 4 7 SARS-CoV P2 4 1 SARS-CoV HKU-39849 4 1 SARS-CoV GDH-BJH01 4 1
April 23, 2020 6/16
Sequence 0-1250 bps Fig 3. cDNA visualization for the first 1,250 bps from the input dataset, for each of the 553 samples. Each sample is represented by a horizontal line of pixels. Colored pixels represent bases: G=green, C=blue, A=red, T=orange, missing=black. The data is separated by class (Table 1).
Sequence 0-1250 Fig 4. The output of convolutional filter 0, for the input given in Fig. 3. The output of the filters is a series of continuous values in (0, 1), here represented in grayscale, with higher values closer to white.
April 23, 2020 7/16
The output of the max pooling for each filter is then further inspected for patterns. 103
An example of the output of the max pooling for the first filter is displayed in Fig. 5: It 104
is noticeable how samples belonging to different classes can be already visually 105
distinguished. At this step, we identify filter 0 as the most promising, as it seems to 106
focus on a few relevant points in the genome, that could correspond to meaningful 107
cDNA sequences. 108
Fig 5. Visualization of the output of the max pooling for the first filter of the CNN, with the data visualized in Fig. 4 in input. Different patterns for samples from different classes are recognizable from a simple visual inspection. Here samples from the same class appear grouped, one after the other, for clarity.
Given this data, it is now possible to identify the 21-bps sequences that obtained the 109
highest output values in the max pooling layer of filter 0, in a section of 148 positions. 110
This process results in 210 (31,029 divided by 148) max pooling features, each one 111
identifying the 21-bps sequence that obtained the highest value from the convolutional 112
filter, in a specific 148-position interval of the original genome: The first max pooling 113
feature will cover positions 1-148, the second will cover position 149-296, and so on. We 114
graph the whole set of max pooling features for the complete data 4,410 (210*21), Fig. 6. 115
Analyzing the different sequence values appearing in the max pooling feature space, 116
April 23, 2020 8/16
Sequence 0 - 2,205
Sequence 2,205 - 4,410 Fig 6. cDNA visualization for the selected 210 21-bps-long sequences selected from the input dataset. Each sample is represented by a horizontal line of pixels. Colored pixels represent bases: G=green, C=blue, A=red, T=orange, missing=black. We divide the whole information, for visualization purposes; from visual inspection we can see the similarity of the patterns between the classes.
a total of 3,827 unique 21-bps cDNA sequences, that can potentially be very informative 117
for identifying different virus strains. For example, sequence AGG TAA CAA ACC 118
AAC CAA CTT is only found inside the class of SARS-CoV-2, in 59 out of 66 119
available samples. Sequence CAC GAG TAA CTC GTC TAT CTT is present 120
again only in SARS-CoV-2, in 63 out of the 66 samples. 121
The combination of the convolutional and max pooling layer allows the CNN to 122
identify sequences even if they are slightly displaced in the genome (by up to 148 123
positions). As some samples might present sequences that are displaced even more, in 124
the next experiments we decided to just consider the relative frequency of the 21-pbs 125
sequences identified at the previous step, creating a sequence feature space, to verify 126
whether the appearance of specific sequences at any point in the genomecould be 127
enough to differentiate between virus strains. 128
Identifying SARS-CoV-2 129
Experiment 1: Validation on the NGDC dataset 130
We downloaded the dataset from the National Genomic Data Center (NGDC) 131
repository [6] on March 15th 2020. We removed repeated sequences and applied the 132
whole procedure to translate the data into the sequence feature space. This leaves us 133
with a frequency table of 3,827 features with 583 samples (Table 2). Next, we ran a 134
state-of-the-art feature selection algorithm [30], to reduce the sequences needed to 135
identify different virus strain to the bare minimum. Remarkably, we are then able to 136
correctly classify all samples using only 53 of the original 3,827 sequences, obtaining a 137
100% accuracy in a 10-fold cross-validation with a simpler and more traditional 138
classifier, such as Logistic Regression. 139
April 23, 2020 9/16
Table 2. Organism, assigned label, and number of samples in the unique sequences obtained from the repository [6]. We use the NCBI organism naming convention [29].
Organism Label Number of Samples
SARS-CoV-2 0 96 MERS-CoV 1 236 HCoV-OC43 2 136 HCoV-229E 2 22 HCoV-EMC 2 6 HCoV-4408 2 2 HCoV-NL63 3 58 HCoV-HKU1 3 17 SARS-CoV 4 7 SARS-CoV P2 4 1 SARS-CoV HKU-39849 4 1 SARS-CoV GDH-BJH01 4 1
Experiment 2: Validation on the NCBI dataset 140
We downloaded data from NCBI [22] on March 15th 2020, with the following query: 141
gene=“ORF1ab” AND host=“homo sapiens” AND “complete genome”. The query 142
resulted in 407 non-repeated sequences (Table 3), with 68 sequences belonging to 143
SARS-CoV-2. Then, we applied the whole procedure to translate the data into the 144
sequence feature space, and we run the same state-of-the-art feature selection 145
algorithm [30]. The result is a list of 10 different sequences (Table 4), for which just 146
checking for their presence is enough to differentiate between SARS-CoV-2 and other 147
viruses in the dataset, with a 100% accuracy. Each of the sequences, in fact, only 148
appears in SARS-CoV-2 samples. 149
Table 3. Organism, assigned label, and number of samples in the unique sequences obtained from the repository NCBI [29].
Virus Label Number of Samples
SARS-CoV-2 0 68 MERS-CoV 1 180 HCoV-OC43 1 105 HCoV-NL63 1 29 HCoV-HKU1 1 13 HCoV-4408 1 2 HCoV-229E 1 3 HCoV-EMC 1 3 HAstV-VA1 1 1 HAstV-BF34 1 1 HMO-A 1 1 HAstV-SG 1 1
Experiment 3: Further validation on the NCBI dataset 150
We downloaded data from NCBI [22] on March 17th 2020, with the following query: 151
“virus” AND host=“homo sapiens” AND “complete genome”, restricting the size from 152
1,000 to 35,000 bps. The query returns 20,603 samples, of which only 32 belong to 153
SARS-CoV-2, and 20,571 are from other taxa, including Hepatitis B, Dengue, Human 154
immunodeficiency, Human orthopneumovirus, Enterovirus A, Hepacivirus C, 155
Chikungunya virus, Zaire ebolavirus, Human respirovirus 3, Orthohepevirus A, 156
Norovirus GII, Hepatitis delta virus, Mumps rubulavirus, Enterovirus D, Zika virus, 157
Measles morbillivirus, Enterovirus C, Human T-cell leukemia virus type I, Yellow fever 158
virus, Adeno-associated virus, rhinovirus (A, B and C), for a total of more than 900 159
April 23, 2020 10/16
Table 4. Sequences that only exist in SARS-CoV-2, that help differentiate between the virus and other taxa as displayed in Table 3.
TAG CAC TCT CCA AGG GTG TTC CAT CTA CTG ATT GGA CTA GCT AAT GAA TTA TCA AGT TAA TGG CAC GTA GGA ATG TGG CAA CTT TGA GCA GTG CTG ACT CAA CTC CAA CTT TTA ACG TAC CAA TGG CTA AAG CAT ACA ATG TAA CAC GAT GGT CAA GTA GAC TTA TTT TGC CAC TTG GCT ATG TAA CAC TAT TAG TGA TAT GTA CGA CCC
viruses. Then, we we applied the whole procedure to translate the data into the 160
sequence feature space and run the feature reduction algorithm [30]. This results in 2 161
sequences of 21 bps: just by checking for their presence, we are able to separate 162
SARS-CoV-2 from the rest of the samples with a 100% accuracy. The sequences are: 163
AAT AGA AGA ATT ATT CTA TTC and CGA TAA CAA CTT CTG 164
TGG CCC. 165
Experiment 4: Validation on the GISAID dataset 166
From the GISAID repository [23], we downloaded 9,294 sequences available on April 167
16th, for SARS-CoV-2, from different countries, from there 8,256 are non-repeated, 168
complete and host=”homo sapiens”. Then, we calculated the frequency table of the 169
21-bps sequences obtained from experiments 2 and 3, to verify which sequences remain 170
and could be used for detection. The appearance frequency of the target sequences 171
among the samples in the GISAID dataset is reported in Table 5. From the 8,256 172
sequences 87.13% have the 12 sequences, and all of them have at least 5 or more. 173
Table 5. Frequency of appearance (in percentage) for the sequences discovered in experiments 2 and 3, among the 8,256 samples from GISAID [23].
Genomic Sequence Frequency
CAC GTA GGA ATG TGG CAA CTT 99.73% TAT TAG TGA TAT GTA CGA CCC 99.60% AAT GAA TTA TCA AGT TAA TGG 99.55% AAT AGA AGA ATT ATT CTA TTC 99.54% CAA CTT TTA ACG TAC CAA TGG 99.38% CTA AAG CAT ACA ATG TAA CAC 99.38% TAG CAC TCT CCA AGG GTG TTC 99.29% CGA TAA CAA CTT CTG TGG CCC 98.84% TGC CAC TTG GCT ATG TAA CAC 97.57% CAT CTA CTG ATT GGA CTA GCT 97.40% TGA GCA GTG CTG ACT CAA CTC 96.06% GAT GGT CAA GTA GAC TTA TTT 95.24%
Experiment 5: Laboratory validation of the candidate primer set. 174
Viral RNA was isolated from cell-cultured SARS-CoV-2, SARS-1, MERS-CoV, 175
hCoV-NL63, hCoV-OC43, hCoV-229E, and from nasopharyngeal swabs from patients 176
by MagNA Pure LC (Roche Diagnostics, The Netherlands) using the total nucleic acid 177
April 23, 2020 11/16
isolation kit. The RNA was converted into cDNA using SuperscriptIII (Thermo-Fisher 178
Scientific, USA) and random hexamers. Subsequently, conventional PCR was performed 179
on the cDNA using HotStar Taq DNA polymerase (Qiagen, The Netherlands) with 180
400nM forward primer (5’-AG CAC TCT CCA AGG GTG TTC-3’) and 400nM 181
reverse primer (5’-GCA AAG CCA AAG CCT CAT TA- 3’) and the following 182
cycling conditions: 15 min at 950C, followed by 40 cycles of 1 min. at 950C , 1 min. at 183
520C and 1 min. at 720C. The PCR products were visualized by electrophoresis. The 184
same RNA was used in a diagnostics reference assay (Corman et al. [3]) and the Cycle 185
threshold values form this reference assay were used for estimating sensitivity. 186
Results and Discussion 187
Identifying SARS-CoV-2 188
Summarizing the results of experiments 1-4, we discovered 12 meaningful 21-bps 189
sequences that best characterize SARS-CoV-2. For all the analyzed data, these 190
sequences appear only in SARS-CoV-2 samples and not in any other viruses, as 191
summarized in Table 6. 192
Table 6. Percentage of appearance for each of the 12 discovered 21-bps sequences across the different datasets, and comparison to similar viruses in nature and other hosts.
Source GISAID NCBI NCBI NGDC NGDC GISAID GISAID GISAID Virus SARS-CoV-2 Other Taxa SARS-CoV-2 Other Taxa SARS-CoV-2 Betacoronavirus SARS-related coronavirus betacoronavirus Host Homo Sapiens Homo Sapiens Homo Sapiens Homo Sapiens Homo Sapiens Manis javanica Rhinolophus affinis Canine # Samples 8,256 20,572 32 487 96 9 1 1 CAC GTA GGA ATG TGG CAA CTT 99.73% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 100.00% TAT TAG TGA TAT GTA CGA CCC 99.60% 0.00% 100.00% 0.00% 97.92% 0.00% 0.00% 100.00% AAT GAA TTA TCA AGT TAA TGG 99.55% 0.00% 100.00% 0.00% 96.88% 100.00% 0.00% 100.00% AAT AGA AGA ATT ATT CTA TTC 99.54% 0.00% 100.00% 0.00% 96.88% 0.00% 100.00% 100.00% CAA CTT TTA ACG TAC CAA TGG 99.38% 0.00% 100.00% 0.00% 97.92% 0.00% 0.00% 100.00% CTA AAG CAT ACA ATG TAA CAC 99.38% 0.00% 100.00% 0.00% 100.00% 0.00% 0.00% 100.00% TAG CAC TCT CCA AGG GTG TTC 99.29% 0.00% 100.00% 0.00% 97.92% 0.00% 0.00% 100.00% CGA TAA CAA CTT CTG TGG CCC 98.84% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 0.00% TGC CAC TTG GCT ATG TAA CAC 97.57% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 100.00% CAT CTA CTG ATT GGA CTA GCT 97.40% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 100.00% TGA GCA GTG CTG ACT CAA CTC 96.06% 0.00% 100.00% 0.00% 98.96% 0.00% 0.00% 100.00% GAT GGT CAA GTA GAC TTA TTT 95.24% 0.00% 100.00% 0.00% 96.88% 0.00% 0.00% 100.00%
After the analysis carried out on the deep learning model, we remark that the 193
discovered sequence TAG CAC TCT CCA AGG GTG TTC in particular, 194
exclusive to SARS-CoV-2, shows a frequency of appearance of 99.29% in viral genomes 195
available from different countries in GISAID [23] and 100.0% in the NCBI [22] dataset. 196
Using sample NCBI NC045512.2 as the reference SARS-CoV-2 sequence, we identify 197
that this discovered sequence is located between nucleotides 25,604 and 25,624 in the 198
ORF3a gene. In SARS-CoV, this gene encodes a protein of 274 aa, that is related with 199
necrotic cell death [31], chemokine production like interleukin 8 (IL-8) and 200
RANTES/CCL5, NFκB activation resulting in an inflammatory response [32] and may 201
play an important role in the virus life cycle [33]. We design a specific primer set for 202
detection of SARS-CoV-2 using Primer3plus [34]. We use TAG CAC TCT CCA 203
AGG GTG TTC as forward primer and GCA AAG CCA AAG CCT CAT TA 204
as reverse primer, obtaining an amplicon size of 178 bps. Then, we run an in-silico PCR 205
test using FastPCR 6.7 [35] with default parameters, this yields the results reported in 206
Fig. 7. 207
Laboratory validation of the candidate primer set 208
To validate the data obtained in silico by laboratory methods a conventional PCR was 209
performed on RNA from SARS-CoV-2 and other human coronaviruses. In addition, 210
RNAs from nasopharyngeal swabs from six patients previously diagnosed with 211
SARS-CoV-2 infection and four patients negative for SARS-CoV-2 by routine diagnostic 212
April 23, 2020 12/16
Fig 7. In-silico PCR test results from the FastPCR 6.7 software [35] using sequences TAG CAC TCT CCA AGG GTG TTC and GCA AAG CCA AAG CCT CAT TA as primers, in NC045512.2 used as a reference SARS-CoV-2 sequence.
April 23, 2020 13/16
method [3] were analyzed with the same conventional PCR (Fig. 8). Different dilutions 213
of SARS-CoV-2 RNA were detected with similar sensitivity compared to the diagnostic 214
reference assay. (Fig. 8 lanes 1-8). Our candidate primer set exclusively detected 215
SARS-CoV-2 and did not amplify RNA from other human coronaviruses (Figure 9, lanes 216
9-14). The candidate primer set was able to detect SARS-CoV-2 RNA from patient 217
samples previously found positive for SARS-CoV-2, but not in patients previously found 218
negative (Fig. 8, lanes 15-24). Although further validation will be required to develop 219
this candidate primer set into a diagnostic assay, our results clearly demonstrate the 220
power of our method to select potential sequences for further validation. 221
Fig 8. Laboratory validation of the candidate primer set by conventional PCR. MM, molecular weight marker; Lanes 1-8, 10-fold dilutions of SARS-CoV-2 RNA (corresponding to Ct values 26 to 39 in the diagnostic reference assay); Lanes 9-14, RNA from different human coronaviruses (hCoV-OC43, hCoV-229E, hCoV-NL63, MERS-CoV, SARS-1, SARS-CoV-2 respectively); Lanes 15, 16, 17, 19, 20, 21, patient samples previously found positive for SARS-CoV-2; Lanes 18, 22, 23, 24, patient samples previously found negative for SARS-CoV-2
April 23, 2020 14/16
Conclusions 222
Being able to reliably identify SARS-CoV-2 and distinguish it from other similar 223
pathogens is important to contain its spread. The time of processing samples and the 224
availability of reliable diagnostic tests is a challenge during an outbreak. Developing 225
innovative diagnostic tools that target the genome to improve the identification of 226
pathogens, can help reduce health costs and time to identify the infection, instead of 227
using unsuitable treatments or testing. Moreover, it is necessary to perform an accurate 228
classification to identify the different species of Coronavirus, the genetic variants that 229
could appear in the future, and the co-infections with other pathogens. 230
Given the high transmissibility of the SARS-CoV-2, the proper diagnosis of the 231
disease is urgent, to stop the virus from spreading further. Considering the false 232
negatives given by the standard nucleic acid detection, better implementations such as 233
using deep learning are necessary in order to properly detect the virus. While the 234
accuracy of current nucleic acid testing is around 30-50%, and CT scans with deep 235
learning go up at 83%, we believe that the use of the sequences detected by a 236
CNN-based system has the potential to improve the accuracy of the diagnosis above 237
95%. 238
Our results, show that by targeting only 12 21-bps specific sequences, we are able to 239
distinguish SARS-CoV-2, from any other virus (> 99%). Further testing is necessary to 240
confirm these promising results so it is essential to create multidisciplinary groups that 241
work to stop the outbreak. Finally, as an interesting remark, by comparing the 242
discovered sequences against other hosts, we noticed that from the 12 sequences 243
exclusive to SARS-CoV-2, one of them appears in all of the 9 samples from Manis 244
Javanina. In contrast, 5 of the sequences of SARS-CoV-2 appear in the only sample 245
available from Rhinolophus Affinis and 11 out of 12 in a Canine sample. This is 246
consistent with the findings of Zhang et al. [36, 37], and could point to the zootonic 247
origin of the virus. Nevertheless, more data is necessary. 248
As a result of the high density populations, and ever growing interaction between 249
people, it is possible that other pandemics may occur. Thus, thinking forward, our 250
methodology can be applied in future viral pandemics to speed up the development of 251
accurate detection methods for diagnosis and thereby contribute to limit the spread of a 252
virus. 253
Ethical approval 254
The study was approved by the Medical Ethical Commission of the Erasmus MC 255
(MEC-2015-306). 256
References
1. Woo PC, Huang Y, Lau SK, Yuen KY. Coronavirus genomics and bioinformatics analysis. viruses. 2010;2(8):1804–1820.
2. Lu R, Zhao X, Li J, Niu P, Yang B, Wu H, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet. 2020;395(10224):565–574.
3. Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020;25(3).
April 23, 2020 15/16
4. Chu DK, Pan Y, Cheng S, Hui KP, Krishnan P, Liu Y, et al. Molecular diagnosis of a novel coronavirus (2019-nCoV) causing an outbreak of pneumonia. Clinical chemistry. 2020;.
5. Marston DA, McElhinney LM, Ellis RJ, Horton DL, Wise EL, Leech SL, et al. Next generation sequencing of viral RNA genomes. BMC genomics. 2013;14(1):444.
6. Beijing Institute of Genomics, Chinese Academy of Science. China National Center for Bioinformation & National Genomics Data Center; 2013. https://bigd.big.ac.cn/ncov/?lang=en.
7. Organization WH. WHO report Coronavirus disease 2019 (COVID-19). Geneva :: World Health Organization,; 2020.
8. Wang Y, Kang H, Liu X, Tong Z. Combination of RT-qPCR Testing and Clinical Features For Diagnosis of COVID-19 facilitates management of SARS-CoV-2 Outbreak. Journal of Medical Virology. 2020;.
9. Metsky HC, Freije CA, Kosoko-Thoroddsen TSF, Sabeti PC, Myhrvold C. CRISPR-based surveillance for COVID-19 using genomically-comprehensive machine learning design. bioRxiv. 2020;.
10. Wang M, Wu Q, Xu W, Qiao B, Wang J, Zheng H, et al. Clinical diagnosis of 8274 samples with 2019-novel coronavirus in Wuhan. medRxiv. 2020;.
11. Wang S, Kang B, Ma J, Zeng X, Xiao M, Guo J, et al. A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). medRxiv. 2020;.
12. Kim JY, Choe PG, Oh Y, Oh KJ, Kim J, Park SJ, et al. The first case of 2019 novel coronavirus pneumonia imported into Korea from Wuhan, China: implication for infection prevention and control measures. Journal of Korean Medical Science. 2020;35(5).
13. Pearson WR. [5] Rapid and sensitive sequence comparison with FASTP and FASTA. 1990;.
14. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403–410.
15. Pinello L, Lo Bosco G, Yuan GC. Applications of alignment-free methods in epigenomics. Briefings in Bioinformatics. 2014;15(3):419–430.
16. Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioinformatics. 2003;19(4):513–523.
17. Bzhalava D, Ekstro¨m J, Lysholm F, Hultin E, Faust H, Persson B, et al. Phylogenetically diverse TT virus viremia among pregnant women. Virology. 2012;432(2):427–434.
18. Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, et al. DNA sequence classification by convolutional neural network. Journal of Biomedical Science and Engineering. 2016;9(05):280.
19. Rizzo R, Fiannaca A, La Rosa M, Urso A. A deep learning approach to dna sequence classification. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer; 2015. p. 129–140.
April 23, 2020 16/16
20. Tampuu A, Bzhalava Z, Dillner J, Vicente R. ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples. PloS one. 2019;14(9).
21. Ren J, Song K, Deng C, Ahlgren NA, Fuhrman JA, Li Y, et al. Identifying viruses from metagenomic data by deep learning. arXiv preprint arXiv:180607810. 2018;.
22. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29(1):308–311.
23. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017;22(13).
24. Ribeiro CdS, van Roode MY, Haringhuizen GB, Koopmans MP, Claassen E, van de Burgwal LH. How ownership rights over microorganisms affect infectious disease control and innovation: a root-cause analysis of barriers to data sharing as experienced by key stakeholders. PloS one. 2018;13(5).
25. Simon JH, Claassen E, Correa CE, Osterhaus AD. Managing severe acute respiratory syndrome (SARS) intellectual property rights: the possible role of patent pooling. Bulletin of the World Health Organization. 2005;83:707–710.
26. Ribeiro CdS, Koopmans MP, Haringhuizen GB. Threats to timely sharing of pathogen sequence data. Science. 2018;362(6413):404–406.
27. Lopez-Rincon A, Tonda A, Mendoza-Maldonado L, Claassen E, Garssen J, Kraneveld AD. Accurate Identification of SARS-CoV-2 from Viral Genome Sequences using Deep Learning. bioRxiv. 2020;doi:10.1101/2020.03.13.990242.
28. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
29. Mizrachi I. GenBank: the nucleotide sequence database. The NCBI Handbook [Internet], updated. 2007;22.
30. Lopez-Rincon A, Martinez-Archundia M, Martinez-Ruiz GU, Schoenhuth A, Tonda A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC bioinformatics. 2019;20(1):480.
31. Shi CS, Nabar NR, Huang NN, Kehrl JH. SARS-Coronavirus Open Reading Frame-8b triggers intracellular stress pathways and activates NLRP3 inflammasomes. Cell death discovery. 2019;5(1):1–12.
32. Kanzawa N, Nishigaki K, Hayashi T, Ishii Y, Furukawa S, Niiro A, et al. Augmentation of chemokine production by severe acute respiratory syndrome coronavirus 3a/X1 and 7a/X4 proteins through NF-κB activation. FEBS letters. 2006;580(30):6807–6812.
33. Padhan K, Tanwar C, Hussain A, Hui PY, Lee MY, Cheung CY, et al. Severe acute respiratory syndrome coronavirus Orf3a protein interacts with caveolin. Journal of General Virology. 2007;88(11):3067–3077.
34. Untergasser A, Nijveen H, Rao X, Bisseling T, Geurts R, Leunissen JA. Primer3Plus, an enhanced web interface to Primer3. Nucleic acids research. 2007;35(suppl 2):W71–W74.
April 23, 2020 17/16
35. Kalendar R, Lee D, Schulman AH, et al. FastPCR software for PCR primer and probe design and repeat search. Genes, Genomes and Genomics. 2009;3(1):1–14.
36. Zhang YZ, Holmes EC. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell. 2020;.
37. Xia X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Molecular Biology and Evolution. 2020;.