Specific Primer Design for Accurate Detection of SARS-CoV ...April 23, 2020 5/16 Fig 2. Overall procedure to find the specific SARS-CoV-2 sequences to create a primer set. The trained

April 23, 2020 1/16

Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning Alejandro Lopez-Rincon1, Alberto Tonda2, Lucero Mendoza-Maldonado3, Daphne

G.J.C. Mulders4, Richard Molenkamp4, Eric Claassen5, Johan Garssen1,6, Aletta D. Kraneveld1 1. Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science,

Utrecht University, Universiteitsweg 99, 3584 CG Utrecht, the Netherlands 2. UMR 518 MIA-Paris, INRAE, c/o 113 rue Nationale, 75103, Paris, France 3. Hospital Civil de Guadalajara ”Dr. Juan I. Menchaca”. Salvador Quevedo y Zubieta 750,

Independencia Oriente, C.P. 44340 Guadalajara, Jalisco, Mexico 4. Department of Viroscience, Erasmus Medical Center, Rotterdam, the Netherlands 5. Athena Institute, Vrije Universiteit, De Boelelaan 1085, 1081 HV Amsterdam, the Netherlands. 6. Department Immunology, Danone Nutricia research, Uppsalalaan 12, 3584 CT Utrecht, the

Netherlands (Submitted: 23 April 2020 – Published online: 27 April 2020)

DISCLAIMER This paper was submitted to the Bulletin of the World Health Organization and was posted to the COVID-19 open site, according to the protocol for public health emergencies for international concern as described in Vasee Moorthy et al. (http://dx.doi.org/10.2471/BLT.20.251561). The information herein is available for unrestricted use, distribution and reproduction in any medium, provided that the original work is properly cited as indicated by the Creative Commons Attribution 3.0 Intergovernmental Organizations licence (CC BY IGO 3.0).

RECOMMENDED CITATION

Lopez-Rincon A, Tonda A, Mendoza-Maldonado L, Mulders D.G.J.C., Molenkamp R, Claassen E, et al. Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning. [Preprint]. Bull World Health Organ. E-pub: 27 April 2020. doi: http://dx.doi.org/10.2471/BLT.20.261842

April 23, 2020 2/16

Abstract

Objective

Deep learning techniques can deliver remarkable results in biology, being able to flawlessly perform complex tasks such as differentiating virus strains of the same family. Nevertheless, most of the models created by these algorithms are black boxes, and their decision process is impervious to human interpretation. In this paper, we apply techniques of explainable AI to the task of discovering representative genomic sequences in SARS-CoV-2, to finally generate specific primers.

Methods

Starting from a convolutional neural network trained on 553 sequences from the 2019 Novel Coronavirus Resource database, we distinguish the genome of virus strains from the Coronavirus family with considerable accuracy (> 98%). Next, we analyze the network’s behavior on every sample, to discover sequences used by the model to classify SARS-CoV-2. Then, using feature selection algorithms we find sequences exclusive to SARS-CoV-2. Finally, using the identified sequences we develop a SARS-CoV-2 specific primer set and test it using a conventional PCR.

Findings

A first validation, performed on 583 samples from the NGDC repository and 20,604 from the NCBI repository, show that we can identify SARS-CoV-2 from more than 900 other viruses with a high classification accuracy (> 99%) using only 12 sequences of 21 base pairs. Then, we compute the frequency of appearance of these 12 sequences in 9,294 samples from the GISAID repository, observing frequencies ranging from 95.24% to 99.73%. Finally, we use one of the sequences as a forward primer, generating a primert. Testing the primer set using a conventional PCR delivers a sensibility similar to routine diagnostic methods, and 100% specificity when comparing to other coronaviruses and differentiating between SARS-CoV-2 positive patients (n=5) and controls (n=3).

Conclusion

Our methodology, combining deep learning, viromics and primer design,to develop accurate detection of SARS-CoV-2 by means of qPCR proved to be effective. This approach of detection using cDNA or DNA sequences can be applied to a range of different problems, like mutations in cancer, and autoimmune diseases. Finally, considering the possibility of future pandemics this technology will be suitable to fast and accurately create methods for diagnostics to combat the spread.

Introduction 1

The Coronaviridae family presents a positive sense, single-strand RNA genome. These 2

viruses have been identified in avian and mammal hosts, including humans. 3

Coronaviruses have genomes from 26.4 kilo base-pairs (kbps) to 31.7 kbps, with G + C 4

contents varying from 32% to 43%; human-infecting coronaviruses belonging to this 5

family include SARS-CoV, MERS-CoV, HCoV-OC43, HCoV-229E, HCoV-NL63 and 6

HCoV-HKU1 [1]. In December 2019, SARS-CoV-2, a novel, human-infecting 7

Coronavirus was identified in Wuhan, China, using Next Generation Sequencing 8

(NGS) [2]. 9

As a typical RNA virus, new mutations appears every replication cycle of 10

Coronavirus, and its average evolutionary rate is roughly 10-4 nucleotide substitutions 11

April 23, 2020 3/16

per site each year [2]. In the specific case of SARS-CoV-2, RT-qPCR testing using 12

primers in ORF1ab and N genes have been used to identified the infection in humans [3]. 13

However, this method presents a high false negative rate (FNR), with a detection rate 14

of 30-50% [4]. This low detection rate can be explained by the variation of viral RNA 15

sequences within virus species, and the viral load in different anatomic sites [5]. 16

Population mutation frequency of site 8,872 located in ORF1ab gene and site 28,144 17

located in ORF8 gene gradually increased from 0 to 29% as the epidemic progressed [6]. 18

As of March 27th of 2020, the new SARS-CoV-2 has 462,684 confirmed cases across 19

almost all countries, with 250,287 cases in the European region [7]. In addition, 20

SARS-CoV-2 has an estimated mortality rate of 3-4%, and it is spreading faster than 21

SARS-CoV and MERS-CoV [8]. SARS-CoV-2 assays can yield false positives if they are 22

not targeted specifically to SARS-CoV-2, as the virus is closely related to other 23

Coronavirus organisms. In addition, SARS-CoV-2 may present with other respiratory 24

infections, which make it even more difficult to identify [9, 10]. 25

Thus, it is fundamental to improve existing diagnostic tools to contain the spread. 26

For example, diagnostic tools combining computed tomography (CT) scans with deep 27

learning have been proposed, achieving an improved detection accuracy of 82.9% [11]. 28

Another solution for identifying SARS-CoV-2 is additional sequencing of the viral 29

complementary DNA (cDNA). We can use sequencing data with cDNA, resulting from 30

the PCR of the original viral RNA; e,g, Real-Time PCR amplicons (Fig. 1) to identify 31

the SARS-CoV-2 [12]. 32

Classification using viral sequencing techniques is mainly based on alignment 33

methods such as FASTA [13] and BLAST [14]. These methods rely on the assumption 34

that DNA sequences share common features, and their order prevails among different 35

sequences [15, 16]. However, these methods suffer from the necessity of needing base 36

sequences for the detection [17]. Nevertheless, it is necessary to develop innovative 37

improved diagnostic tools that target the genome to improve the identification of 38

Fig 1. PCR Amplicons sequencing procedure.

pathogenic variants, as sometimes several tests, are needed to have an accurate 39

diagnosis. As an alternative deep learning methods have been suggested for 40

classification of DNA sequences, as these methods do not need pre-selected features to 41

identify or classify DNA sequences. Deep Learning has been efficiently used for 42

classification of DNA sequences, using one-hot label encoding and Convolution Neural 43

Networks (CNN) [18, 19], albeit the examples in literature are featuring DNA sequences 44

of length up to 500 bps, only. 45

In particular, for the case of viruses, NGS genomic samples might not be identified 46

by BLAST, as there are no reference sequences valid for all genomes, as viruses have 47

April 23, 2020 4/16

high mutation frequency [20]. Alternative solutions based on deep learning have been 48

proposed to classify viruses, by dividing sequences into pieces of fixed length, ranging 49

from 300 bps [20] to 3,000 bps [21]. However, this approach has the negative effect of 50

potentially ignoring part of the information contained in the input sequence, that is 51

disregarded if it cannot completely fill a piece of fixed size. 52

Given the impact of the world-wide outbreak, international efforts have been made 53

to simplify the access to viral genomic data and metadata through international 54

repositories, such as The 2019 Novel Coronavirus Resource (2019nCoVR) repository [6], 55

the National Center for Biotechnology Information (NCBI) repository [22] and the 56

Global Initiative on Sharing All Influenza Data (GISAID) repository [23], expecting 57

that the easiness to acquire information would make it possible to develop medical 58

countermeasures to control the disease worldwide, as it happened in similar cases 59

earlier [24–26]. Thus, taking advantage of the available information of international 60

resources without any political and/or economic borders, we propose an innovative 61

system based on viral gene sequencing. 62

Starting from a CNN trained to separate Coronavirus samples belonging to different 63

strains [27], including SARS-CoV-2, we apply techniques inspired by explainable AI in 64

computer vision to discover representative cDNA sequences that the network uses to 65

classify SARS-CoV-2. We then validate the discovered sequences on datasets not used 66

during the training of the CNN, and show how to exploit them to create a novel, highly 67

informative feature space. Experimental results show that the new feature space leads 68

traditional, simple classifiers, to correctly assess SARS-CoV-2 with remarkable accuracy 69

(> 99%). A few of the discovered sequences also possess the correct characteristics for 70

potentially becoming primers, as just checking for their presence in samples is enough to 71

identify SARS-CoV-2. 72

Materials and methods 73

The CNN used during all the experiments is composed of one convolutional layer with 74

12 filters (each with window size 21) with maxpooling (with pool size and stride 148), a 75

fully connected layer (196 rectified linear units with dropout probability 0.5), and a 76

final softmax layer with 5 units, to differentiate the different classes of Coronavirus 77

strains. The optimized used is Adaptive Momentum (ADAM) [28], with learning rate 78

10−5 and a batch size of 50 samples, run for 1,000 epochs [27]. 79

The convolutional layer of the network, in simple terms, is analyzing subsequences of 80

21 base pairs that can appear in different points of the virus genome. We selected 21 as 81

primers have a length of 18-22 bps normally. The pool size of the maxpooling represents 82

the interval in which a specific 21-bps sequence can be recognized (in this case, 148 83

positions). Through the training process, the convolutional layer is de-facto learning 84

new features to characterize the problem, directly from the data. In this specific case, 85

the new features are 21-bps sequences that can more easily separate different virus 86

strains (Fig. 2). By analyzing the result of each filter in a convolutional layer, and how 87

its output interacts with the corresponding max pooling, it is possible to detect 88

human-readable sequences of base pairs that might provide domain experts with 89

relevant information. It is important to notice that these sequences are not bound to 90

specific locations of the genome; thanks to its structure, the CNN is able to detect them 91

and recognize their importance even if their position is displaced in different samples. 92

April 23, 2020 5/16

Fig 2. Overall procedure to find the specific SARS-CoV-2 sequences to create a primer set.

The trained CNN described above obtained a mean accuracy of 98.75% in a 10-fold 93

cross-validation [27]. Once the network is trained, in a first analysis, we plot the inputs 94

and outputs of the convolutional layer, to visually inspect for patterns. As an example, 95

in Fig. 3 we report the visualization of the first 1,250 bps of each of the 553 samples 96

from the NGDC [6] repository (Table 1). 97

Each filter slides a 21-bps window over the input, and for each step produces a single 98

value. The output of a filter is thus a sequence of values in (0, 1): Fig. 4 shows the 99

output of the first filter in the CNN, for the first 1,250 bps of all 553 samples, mapped 100

to a monochromatic image where the closest a value is to 1, the whiter the 101

corresponding pixel is. 102

Table 1. Organism, assigned label, and number of samples in the unique sequences obtained from the repository [6]. We use the NCBI organism naming convention [29].

Organism Label Number of Samples

SARS-CoV-2 0 66 MERS-CoV 1 236 HCoV-OC43 2 136 HCoV-229E 2 22 HCoV-EMC 2 6 HCoV-4408 2 2 HCoV-NL63 3 58 HCoV-HKU1 3 17 SARS-CoV 4 7 SARS-CoV P2 4 1 SARS-CoV HKU-39849 4 1 SARS-CoV GDH-BJH01 4 1

April 23, 2020 6/16

Sequence 0-1250 bps Fig 3. cDNA visualization for the first 1,250 bps from the input dataset, for each of the 553 samples. Each sample is represented by a horizontal line of pixels. Colored pixels represent bases: G=green, C=blue, A=red, T=orange, missing=black. The data is separated by class (Table 1).

Sequence 0-1250 Fig 4. The output of convolutional filter 0, for the input given in Fig. 3. The output of the filters is a series of continuous values in (0, 1), here represented in grayscale, with higher values closer to white.

April 23, 2020 7/16

The output of the max pooling for each filter is then further inspected for patterns. 103

An example of the output of the max pooling for the first filter is displayed in Fig. 5: It 104

is noticeable how samples belonging to different classes can be already visually 105

distinguished. At this step, we identify filter 0 as the most promising, as it seems to 106

focus on a few relevant points in the genome, that could correspond to meaningful 107

cDNA sequences. 108

Fig 5. Visualization of the output of the max pooling for the first filter of the CNN, with the data visualized in Fig. 4 in input. Different patterns for samples from different classes are recognizable from a simple visual inspection. Here samples from the same class appear grouped, one after the other, for clarity.

Given this data, it is now possible to identify the 21-bps sequences that obtained the 109

highest output values in the max pooling layer of filter 0, in a section of 148 positions. 110

This process results in 210 (31,029 divided by 148) max pooling features, each one 111

identifying the 21-bps sequence that obtained the highest value from the convolutional 112

filter, in a specific 148-position interval of the original genome: The first max pooling 113

feature will cover positions 1-148, the second will cover position 149-296, and so on. We 114

graph the whole set of max pooling features for the complete data 4,410 (210*21), Fig. 6. 115

Analyzing the different sequence values appearing in the max pooling feature space, 116

April 23, 2020 8/16

Sequence 0 - 2,205

Sequence 2,205 - 4,410 Fig 6. cDNA visualization for the selected 210 21-bps-long sequences selected from the input dataset. Each sample is represented by a horizontal line of pixels. Colored pixels represent bases: G=green, C=blue, A=red, T=orange, missing=black. We divide the whole information, for visualization purposes; from visual inspection we can see the similarity of the patterns between the classes.

a total of 3,827 unique 21-bps cDNA sequences, that can potentially be very informative 117

for identifying different virus strains. For example, sequence AGG TAA CAA ACC 118

AAC CAA CTT is only found inside the class of SARS-CoV-2, in 59 out of 66 119

available samples. Sequence CAC GAG TAA CTC GTC TAT CTT is present 120

again only in SARS-CoV-2, in 63 out of the 66 samples. 121

The combination of the convolutional and max pooling layer allows the CNN to 122

identify sequences even if they are slightly displaced in the genome (by up to 148 123

positions). As some samples might present sequences that are displaced even more, in 124

the next experiments we decided to just consider the relative frequency of the 21-pbs 125

sequences identified at the previous step, creating a sequence feature space, to verify 126

whether the appearance of specific sequences at any point in the genomecould be 127

enough to differentiate between virus strains. 128

Identifying SARS-CoV-2 129

Experiment 1: Validation on the NGDC dataset 130

We downloaded the dataset from the National Genomic Data Center (NGDC) 131

repository [6] on March 15th 2020. We removed repeated sequences and applied the 132

whole procedure to translate the data into the sequence feature space. This leaves us 133

with a frequency table of 3,827 features with 583 samples (Table 2). Next, we ran a 134

state-of-the-art feature selection algorithm [30], to reduce the sequences needed to 135

identify different virus strain to the bare minimum. Remarkably, we are then able to 136

correctly classify all samples using only 53 of the original 3,827 sequences, obtaining a 137

100% accuracy in a 10-fold cross-validation with a simpler and more traditional 138

classifier, such as Logistic Regression. 139

April 23, 2020 9/16

Table 2. Organism, assigned label, and number of samples in the unique sequences obtained from the repository [6]. We use the NCBI organism naming convention [29].

Organism Label Number of Samples

SARS-CoV-2 0 96 MERS-CoV 1 236 HCoV-OC43 2 136 HCoV-229E 2 22 HCoV-EMC 2 6 HCoV-4408 2 2 HCoV-NL63 3 58 HCoV-HKU1 3 17 SARS-CoV 4 7 SARS-CoV P2 4 1 SARS-CoV HKU-39849 4 1 SARS-CoV GDH-BJH01 4 1

Experiment 2: Validation on the NCBI dataset 140

We downloaded data from NCBI [22] on March 15th 2020, with the following query: 141

gene=“ORF1ab” AND host=“homo sapiens” AND “complete genome”. The query 142

resulted in 407 non-repeated sequences (Table 3), with 68 sequences belonging to 143

SARS-CoV-2. Then, we applied the whole procedure to translate the data into the 144

sequence feature space, and we run the same state-of-the-art feature selection 145

algorithm [30]. The result is a list of 10 different sequences (Table 4), for which just 146

checking for their presence is enough to differentiate between SARS-CoV-2 and other 147

viruses in the dataset, with a 100% accuracy. Each of the sequences, in fact, only 148

appears in SARS-CoV-2 samples. 149

Table 3. Organism, assigned label, and number of samples in the unique sequences obtained from the repository NCBI [29].

Virus Label Number of Samples

SARS-CoV-2 0 68 MERS-CoV 1 180 HCoV-OC43 1 105 HCoV-NL63 1 29 HCoV-HKU1 1 13 HCoV-4408 1 2 HCoV-229E 1 3 HCoV-EMC 1 3 HAstV-VA1 1 1 HAstV-BF34 1 1 HMO-A 1 1 HAstV-SG 1 1

Experiment 3: Further validation on the NCBI dataset 150

We downloaded data from NCBI [22] on March 17th 2020, with the following query: 151

“virus” AND host=“homo sapiens” AND “complete genome”, restricting the size from 152

1,000 to 35,000 bps. The query returns 20,603 samples, of which only 32 belong to 153

SARS-CoV-2, and 20,571 are from other taxa, including Hepatitis B, Dengue, Human 154

immunodeficiency, Human orthopneumovirus, Enterovirus A, Hepacivirus C, 155

Chikungunya virus, Zaire ebolavirus, Human respirovirus 3, Orthohepevirus A, 156

Norovirus GII, Hepatitis delta virus, Mumps rubulavirus, Enterovirus D, Zika virus, 157

Measles morbillivirus, Enterovirus C, Human T-cell leukemia virus type I, Yellow fever 158

virus, Adeno-associated virus, rhinovirus (A, B and C), for a total of more than 900 159

April 23, 2020 10/16

Table 4. Sequences that only exist in SARS-CoV-2, that help differentiate between the virus and other taxa as displayed in Table 3.

TAG CAC TCT CCA AGG GTG TTC CAT CTA CTG ATT GGA CTA GCT AAT GAA TTA TCA AGT TAA TGG CAC GTA GGA ATG TGG CAA CTT TGA GCA GTG CTG ACT CAA CTC CAA CTT TTA ACG TAC CAA TGG CTA AAG CAT ACA ATG TAA CAC GAT GGT CAA GTA GAC TTA TTT TGC CAC TTG GCT ATG TAA CAC TAT TAG TGA TAT GTA CGA CCC

viruses. Then, we we applied the whole procedure to translate the data into the 160

sequence feature space and run the feature reduction algorithm [30]. This results in 2 161

sequences of 21 bps: just by checking for their presence, we are able to separate 162

SARS-CoV-2 from the rest of the samples with a 100% accuracy. The sequences are: 163

AAT AGA AGA ATT ATT CTA TTC and CGA TAA CAA CTT CTG 164

TGG CCC. 165

Experiment 4: Validation on the GISAID dataset 166

From the GISAID repository [23], we downloaded 9,294 sequences available on April 167

16th, for SARS-CoV-2, from different countries, from there 8,256 are non-repeated, 168

complete and host=”homo sapiens”. Then, we calculated the frequency table of the 169

21-bps sequences obtained from experiments 2 and 3, to verify which sequences remain 170

and could be used for detection. The appearance frequency of the target sequences 171

among the samples in the GISAID dataset is reported in Table 5. From the 8,256 172

sequences 87.13% have the 12 sequences, and all of them have at least 5 or more. 173

Table 5. Frequency of appearance (in percentage) for the sequences discovered in experiments 2 and 3, among the 8,256 samples from GISAID [23].

Genomic Sequence Frequency

CAC GTA GGA ATG TGG CAA CTT 99.73% TAT TAG TGA TAT GTA CGA CCC 99.60% AAT GAA TTA TCA AGT TAA TGG 99.55% AAT AGA AGA ATT ATT CTA TTC 99.54% CAA CTT TTA ACG TAC CAA TGG 99.38% CTA AAG CAT ACA ATG TAA CAC 99.38% TAG CAC TCT CCA AGG GTG TTC 99.29% CGA TAA CAA CTT CTG TGG CCC 98.84% TGC CAC TTG GCT ATG TAA CAC 97.57% CAT CTA CTG ATT GGA CTA GCT 97.40% TGA GCA GTG CTG ACT CAA CTC 96.06% GAT GGT CAA GTA GAC TTA TTT 95.24%

Experiment 5: Laboratory validation of the candidate primer set. 174

Viral RNA was isolated from cell-cultured SARS-CoV-2, SARS-1, MERS-CoV, 175

hCoV-NL63, hCoV-OC43, hCoV-229E, and from nasopharyngeal swabs from patients 176

by MagNA Pure LC (Roche Diagnostics, The Netherlands) using the total nucleic acid 177

April 23, 2020 11/16

isolation kit. The RNA was converted into cDNA using SuperscriptIII (Thermo-Fisher 178

Scientific, USA) and random hexamers. Subsequently, conventional PCR was performed 179

on the cDNA using HotStar Taq DNA polymerase (Qiagen, The Netherlands) with 180

400nM forward primer (5’-AG CAC TCT CCA AGG GTG TTC-3’) and 400nM 181

reverse primer (5’-GCA AAG CCA AAG CCT CAT TA- 3’) and the following 182

cycling conditions: 15 min at 950C, followed by 40 cycles of 1 min. at 950C , 1 min. at 183

520C and 1 min. at 720C. The PCR products were visualized by electrophoresis. The 184

same RNA was used in a diagnostics reference assay (Corman et al. [3]) and the Cycle 185

threshold values form this reference assay were used for estimating sensitivity. 186

Results and Discussion 187

Identifying SARS-CoV-2 188

Summarizing the results of experiments 1-4, we discovered 12 meaningful 21-bps 189

sequences that best characterize SARS-CoV-2. For all the analyzed data, these 190

sequences appear only in SARS-CoV-2 samples and not in any other viruses, as 191

summarized in Table 6. 192

Table 6. Percentage of appearance for each of the 12 discovered 21-bps sequences across the different datasets, and comparison to similar viruses in nature and other hosts.

Source GISAID NCBI NCBI NGDC NGDC GISAID GISAID GISAID Virus SARS-CoV-2 Other Taxa SARS-CoV-2 Other Taxa SARS-CoV-2 Betacoronavirus SARS-related coronavirus betacoronavirus Host Homo Sapiens Homo Sapiens Homo Sapiens Homo Sapiens Homo Sapiens Manis javanica Rhinolophus affinis Canine # Samples 8,256 20,572 32 487 96 9 1 1 CAC GTA GGA ATG TGG CAA CTT 99.73% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 100.00% TAT TAG TGA TAT GTA CGA CCC 99.60% 0.00% 100.00% 0.00% 97.92% 0.00% 0.00% 100.00% AAT GAA TTA TCA AGT TAA TGG 99.55% 0.00% 100.00% 0.00% 96.88% 100.00% 0.00% 100.00% AAT AGA AGA ATT ATT CTA TTC 99.54% 0.00% 100.00% 0.00% 96.88% 0.00% 100.00% 100.00% CAA CTT TTA ACG TAC CAA TGG 99.38% 0.00% 100.00% 0.00% 97.92% 0.00% 0.00% 100.00% CTA AAG CAT ACA ATG TAA CAC 99.38% 0.00% 100.00% 0.00% 100.00% 0.00% 0.00% 100.00% TAG CAC TCT CCA AGG GTG TTC 99.29% 0.00% 100.00% 0.00% 97.92% 0.00% 0.00% 100.00% CGA TAA CAA CTT CTG TGG CCC 98.84% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 0.00% TGC CAC TTG GCT ATG TAA CAC 97.57% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 100.00% CAT CTA CTG ATT GGA CTA GCT 97.40% 0.00% 100.00% 0.00% 97.92% 0.00% 100.00% 100.00% TGA GCA GTG CTG ACT CAA CTC 96.06% 0.00% 100.00% 0.00% 98.96% 0.00% 0.00% 100.00% GAT GGT CAA GTA GAC TTA TTT 95.24% 0.00% 100.00% 0.00% 96.88% 0.00% 0.00% 100.00%

After the analysis carried out on the deep learning model, we remark that the 193

discovered sequence TAG CAC TCT CCA AGG GTG TTC in particular, 194

exclusive to SARS-CoV-2, shows a frequency of appearance of 99.29% in viral genomes 195

available from different countries in GISAID [23] and 100.0% in the NCBI [22] dataset. 196

Using sample NCBI NC045512.2 as the reference SARS-CoV-2 sequence, we identify 197

that this discovered sequence is located between nucleotides 25,604 and 25,624 in the 198

ORF3a gene. In SARS-CoV, this gene encodes a protein of 274 aa, that is related with 199

necrotic cell death [31], chemokine production like interleukin 8 (IL-8) and 200

RANTES/CCL5, NFκB activation resulting in an inflammatory response [32] and may 201

play an important role in the virus life cycle [33]. We design a specific primer set for 202

detection of SARS-CoV-2 using Primer3plus [34]. We use TAG CAC TCT CCA 203

AGG GTG TTC as forward primer and GCA AAG CCA AAG CCT CAT TA 204

as reverse primer, obtaining an amplicon size of 178 bps. Then, we run an in-silico PCR 205

test using FastPCR 6.7 [35] with default parameters, this yields the results reported in 206

Fig. 7. 207

Laboratory validation of the candidate primer set 208

To validate the data obtained in silico by laboratory methods a conventional PCR was 209

performed on RNA from SARS-CoV-2 and other human coronaviruses. In addition, 210

RNAs from nasopharyngeal swabs from six patients previously diagnosed with 211

SARS-CoV-2 infection and four patients negative for SARS-CoV-2 by routine diagnostic 212

April 23, 2020 12/16

Fig 7. In-silico PCR test results from the FastPCR 6.7 software [35] using sequences TAG CAC TCT CCA AGG GTG TTC and GCA AAG CCA AAG CCT CAT TA as primers, in NC045512.2 used as a reference SARS-CoV-2 sequence.

April 23, 2020 13/16

method [3] were analyzed with the same conventional PCR (Fig. 8). Different dilutions 213

of SARS-CoV-2 RNA were detected with similar sensitivity compared to the diagnostic 214

reference assay. (Fig. 8 lanes 1-8). Our candidate primer set exclusively detected 215

SARS-CoV-2 and did not amplify RNA from other human coronaviruses (Figure 9, lanes 216

9-14). The candidate primer set was able to detect SARS-CoV-2 RNA from patient 217

samples previously found positive for SARS-CoV-2, but not in patients previously found 218

negative (Fig. 8, lanes 15-24). Although further validation will be required to develop 219

this candidate primer set into a diagnostic assay, our results clearly demonstrate the 220

power of our method to select potential sequences for further validation. 221

Fig 8. Laboratory validation of the candidate primer set by conventional PCR. MM, molecular weight marker; Lanes 1-8, 10-fold dilutions of SARS-CoV-2 RNA (corresponding to Ct values 26 to 39 in the diagnostic reference assay); Lanes 9-14, RNA from different human coronaviruses (hCoV-OC43, hCoV-229E, hCoV-NL63, MERS-CoV, SARS-1, SARS-CoV-2 respectively); Lanes 15, 16, 17, 19, 20, 21, patient samples previously found positive for SARS-CoV-2; Lanes 18, 22, 23, 24, patient samples previously found negative for SARS-CoV-2

April 23, 2020 14/16

Conclusions 222

Being able to reliably identify SARS-CoV-2 and distinguish it from other similar 223

pathogens is important to contain its spread. The time of processing samples and the 224

availability of reliable diagnostic tests is a challenge during an outbreak. Developing 225

innovative diagnostic tools that target the genome to improve the identification of 226

pathogens, can help reduce health costs and time to identify the infection, instead of 227

using unsuitable treatments or testing. Moreover, it is necessary to perform an accurate 228

classification to identify the different species of Coronavirus, the genetic variants that 229

could appear in the future, and the co-infections with other pathogens. 230

Given the high transmissibility of the SARS-CoV-2, the proper diagnosis of the 231

disease is urgent, to stop the virus from spreading further. Considering the false 232

negatives given by the standard nucleic acid detection, better implementations such as 233

using deep learning are necessary in order to properly detect the virus. While the 234

accuracy of current nucleic acid testing is around 30-50%, and CT scans with deep 235

learning go up at 83%, we believe that the use of the sequences detected by a 236

CNN-based system has the potential to improve the accuracy of the diagnosis above 237

95%. 238

Our results, show that by targeting only 12 21-bps specific sequences, we are able to 239

distinguish SARS-CoV-2, from any other virus (> 99%). Further testing is necessary to 240

confirm these promising results so it is essential to create multidisciplinary groups that 241

work to stop the outbreak. Finally, as an interesting remark, by comparing the 242

discovered sequences against other hosts, we noticed that from the 12 sequences 243

exclusive to SARS-CoV-2, one of them appears in all of the 9 samples from Manis 244

Javanina. In contrast, 5 of the sequences of SARS-CoV-2 appear in the only sample 245

available from Rhinolophus Affinis and 11 out of 12 in a Canine sample. This is 246

consistent with the findings of Zhang et al. [36, 37], and could point to the zootonic 247

origin of the virus. Nevertheless, more data is necessary. 248

As a result of the high density populations, and ever growing interaction between 249

people, it is possible that other pandemics may occur. Thus, thinking forward, our 250

methodology can be applied in future viral pandemics to speed up the development of 251

accurate detection methods for diagnosis and thereby contribute to limit the spread of a 252

virus. 253

Ethical approval 254

The study was approved by the Medical Ethical Commission of the Erasmus MC 255

(MEC-2015-306). 256

References

1. Woo PC, Huang Y, Lau SK, Yuen KY. Coronavirus genomics and bioinformatics analysis. viruses. 2010;2(8):1804–1820.

2. Lu R, Zhao X, Li J, Niu P, Yang B, Wu H, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet. 2020;395(10224):565–574.

3. Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020;25(3).

April 23, 2020 15/16

4. Chu DK, Pan Y, Cheng S, Hui KP, Krishnan P, Liu Y, et al. Molecular diagnosis of a novel coronavirus (2019-nCoV) causing an outbreak of pneumonia. Clinical chemistry. 2020;.

5. Marston DA, McElhinney LM, Ellis RJ, Horton DL, Wise EL, Leech SL, et al. Next generation sequencing of viral RNA genomes. BMC genomics. 2013;14(1):444.

6. Beijing Institute of Genomics, Chinese Academy of Science. China National Center for Bioinformation & National Genomics Data Center; 2013. https://bigd.big.ac.cn/ncov/?lang=en.

7. Organization WH. WHO report Coronavirus disease 2019 (COVID-19). Geneva :: World Health Organization,; 2020.

8. Wang Y, Kang H, Liu X, Tong Z. Combination of RT-qPCR Testing and Clinical Features For Diagnosis of COVID-19 facilitates management of SARS-CoV-2 Outbreak. Journal of Medical Virology. 2020;.

9. Metsky HC, Freije CA, Kosoko-Thoroddsen TSF, Sabeti PC, Myhrvold C. CRISPR-based surveillance for COVID-19 using genomically-comprehensive machine learning design. bioRxiv. 2020;.

10. Wang M, Wu Q, Xu W, Qiao B, Wang J, Zheng H, et al. Clinical diagnosis of 8274 samples with 2019-novel coronavirus in Wuhan. medRxiv. 2020;.

11. Wang S, Kang B, Ma J, Zeng X, Xiao M, Guo J, et al. A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). medRxiv. 2020;.

12. Kim JY, Choe PG, Oh Y, Oh KJ, Kim J, Park SJ, et al. The first case of 2019 novel coronavirus pneumonia imported into Korea from Wuhan, China: implication for infection prevention and control measures. Journal of Korean Medical Science. 2020;35(5).

13. Pearson WR. [5] Rapid and sensitive sequence comparison with FASTP and FASTA. 1990;.

14. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403–410.

15. Pinello L, Lo Bosco G, Yuan GC. Applications of alignment-free methods in epigenomics. Briefings in Bioinformatics. 2014;15(3):419–430.

16. Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioinformatics. 2003;19(4):513–523.

17. Bzhalava D, Ekstro¨m J, Lysholm F, Hultin E, Faust H, Persson B, et al. Phylogenetically diverse TT virus viremia among pregnant women. Virology. 2012;432(2):427–434.

18. Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, et al. DNA sequence classification by convolutional neural network. Journal of Biomedical Science and Engineering. 2016;9(05):280.

19. Rizzo R, Fiannaca A, La Rosa M, Urso A. A deep learning approach to dna sequence classification. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer; 2015. p. 129–140.

April 23, 2020 16/16

20. Tampuu A, Bzhalava Z, Dillner J, Vicente R. ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples. PloS one. 2019;14(9).

21. Ren J, Song K, Deng C, Ahlgren NA, Fuhrman JA, Li Y, et al. Identifying viruses from metagenomic data by deep learning. arXiv preprint arXiv:180607810. 2018;.

22. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29(1):308–311.

23. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017;22(13).

24. Ribeiro CdS, van Roode MY, Haringhuizen GB, Koopmans MP, Claassen E, van de Burgwal LH. How ownership rights over microorganisms affect infectious disease control and innovation: a root-cause analysis of barriers to data sharing as experienced by key stakeholders. PloS one. 2018;13(5).

25. Simon JH, Claassen E, Correa CE, Osterhaus AD. Managing severe acute respiratory syndrome (SARS) intellectual property rights: the possible role of patent pooling. Bulletin of the World Health Organization. 2005;83:707–710.

26. Ribeiro CdS, Koopmans MP, Haringhuizen GB. Threats to timely sharing of pathogen sequence data. Science. 2018;362(6413):404–406.

27. Lopez-Rincon A, Tonda A, Mendoza-Maldonado L, Claassen E, Garssen J, Kraneveld AD. Accurate Identification of SARS-CoV-2 from Viral Genome Sequences using Deep Learning. bioRxiv. 2020;doi:10.1101/2020.03.13.990242.

28. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.

29. Mizrachi I. GenBank: the nucleotide sequence database. The NCBI Handbook [Internet], updated. 2007;22.

30. Lopez-Rincon A, Martinez-Archundia M, Martinez-Ruiz GU, Schoenhuth A, Tonda A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC bioinformatics. 2019;20(1):480.

31. Shi CS, Nabar NR, Huang NN, Kehrl JH. SARS-Coronavirus Open Reading Frame-8b triggers intracellular stress pathways and activates NLRP3 inflammasomes. Cell death discovery. 2019;5(1):1–12.

32. Kanzawa N, Nishigaki K, Hayashi T, Ishii Y, Furukawa S, Niiro A, et al. Augmentation of chemokine production by severe acute respiratory syndrome coronavirus 3a/X1 and 7a/X4 proteins through NF-κB activation. FEBS letters. 2006;580(30):6807–6812.

33. Padhan K, Tanwar C, Hussain A, Hui PY, Lee MY, Cheung CY, et al. Severe acute respiratory syndrome coronavirus Orf3a protein interacts with caveolin. Journal of General Virology. 2007;88(11):3067–3077.

34. Untergasser A, Nijveen H, Rao X, Bisseling T, Geurts R, Leunissen JA. Primer3Plus, an enhanced web interface to Primer3. Nucleic acids research. 2007;35(suppl 2):W71–W74.

April 23, 2020 17/16

35. Kalendar R, Lee D, Schulman AH, et al. FastPCR software for PCR primer and probe design and repeat search. Genes, Genomes and Genomics. 2009;3(1):1–14.

36. Zhang YZ, Holmes EC. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell. 2020;.

37. Xia X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Molecular Biology and Evolution. 2020;.

Documents

Specific Primer Design for Accurate Detection of SARS-CoV ...April 23, 2020 5/16 Fig 2. Overall procedure to find the specific SARS-CoV-2 sequences to create a primer set. The trained