Upload
laurent-noe
View
520
Download
0
Embed Size (px)
Citation preview
A coverage criterion for spaced seedsand its applications to SVM string-kernels and
k-mer distances
Laurent Noe, Donald E. K. Martin
LIFL (UMR 8022 Lille 1/CNRS) - Inria Lille, Villeneuve d’Ascq, FranceDepartment of Statistics, North Carolina State University, Raleigh, NC, USA
SeqBio 2014
November 4&5, 2014 - Montpellier
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Outline
1 Introduction to spaced seeds . . .2 Spaced seed coverage
DefinitionAssociated automatonPossible use (as a seed “quality” measure).
3 Experimental results
SVM classifiersAlignment-free distances
4 Conclusion
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )
Definition
A spaced seed π is defined as a binary word over the alphabet {1, *} :
1 : accepts only match symbol | ,
* : accepts all alignment symbols (joker) .
s : span (length), w : weight (number of 1).
Exampleπ = 111*1*11
111*1*11
ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Example
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111111.............................................111111
..........................................
...111111.......................................
......111111
....................................
.........111111
.................................
............111111
..............................
...............111111
...........................
..................111111
........................
.....................111111
.....................
........................111111
..................
...........................111111
...............
..............................111111
............
.................................111111
.........
....................................111111
.............................................111111
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11.......................................111*1*11
....................................
...111*1*11.................................
......111*1*11
..............................
.........111*1*11
...........................
............111*1*11
........................
...............111*1*11
.....................
..................111*1*11
..................
.....................111*1*11
...............
........................111*1*11
............
...........................111*1*11
.........
..............................111*1*11
.......................................111*1*11
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Recent work related to spaced seeds
1 Alignment-free distances[Leimeister et al., 2014, Horwege et al., 2014, Boden et al., 2013]
2 SVM classification[Onodera and Shibuya, 2013, Ghandi et al., 2014]
3 Read clustering[Bao et al., 2011, Chong et al., 2012, Hauser et al., 2013]
4 Metagenomic classification, . . .
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
“New Uses for Old Things”
little boy
⇒⇒⇒⇒
frying pan
1
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
“New Uses for Old Things”
little boy
⇒⇒⇒⇒
frying pan
1
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
“New Uses for Old Things”
little boy
⇒⇒⇒⇒
frying pan
1
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
“New Uses for Old Things”
little boy
⇒⇒⇒⇒
frying pan
1
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
“New Uses for Old Things”
little boy
⇒⇒⇒⇒
frying pan
1
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]
Example
ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
Coverage is of 15
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]
ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
Coverage is of 15
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]
ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
Coverage is of 15
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]
ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A
111*1*11
111*1*11
111*1*11
Coverage is of 15
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]
ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A
111*1*11
111*1*11
111*1*11
Coverage is of 15
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
/ a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1πocc3
......
......
... 1... 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
/ a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1πocc3
......
......
... 1... 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
/ a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1
πocc2
......
...
1 1 * 1πocc3
......
...
...... 1
... 1
x = 1 0 1•
1•
1 1•
0 0 1 0 1 1 1 1 1
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
/ a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1
πocc3
......
......
...
1
...
1
x = 1 0 1•
1•
1 1•
0 0 1 0 1•
1•
1 1•
1
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed
/ a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1πocc3
......
...... 1 1 * 1
x = 1 0 1•
1•
1 1•
0 0 1 0 1•
1•
1•
1•
1•
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed / a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1πocc3
......
...... 1 1 * 1
x = 1 0 1•
1•
1 1•
0 0 1 0 1•
1•
1•
1•
1•
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed / a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1πocc3
......
...... 1 1 * 1
x = 1 0 1•
1•
1 1•
0 0 1 0 1•
1•
1•
1•
1•
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1
x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed / a set of seeds
alignment : x = 101111001011111
Example
seed : π = 11*1
πocc1 1 1 * 1πocc2
......
... 1 1 * 1πocc3
......
...... 1 1 * 1
x = 1 0 1•
1•
1 1•
0 0 1 0 1•
1•
1•
1•
1•
set of seeds : {π1, π2} = {11*1, 1*1*1}
π2 occ1 1 * 1 * 1π1 occ2
... 1 1 * 1π2 occ3
......
......
... 1 * 1 * 1π1 occ4
......
......
...... 1 1 * 1
π2 occ5
......
......
...... 1 * 1 * 1
π1 occ6
......
......
......
... 1 1 * 1x = 1
•0 1
•1•
1•
1•
0 0 1•
0 1•
1•
1•
1•
1•
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed / a set of seeds
That’s how coverage can be measured,
estimated, computed on several models. . .
But, . . . is coverage useful?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Coverage measure for a seed / a set of seeds
That’s how coverage can be measured,
estimated, computed on several models. . .
But, . . . is coverage useful?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Experimental results
1 SVM classifiers2 Alignment-free distances
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
Are spaced seeds better with string kernels classifiers?
Yes: see [Onodera and Shibuya, 2013, Ghandi et al., 2014]
Which spaced seed patterns are better? Does coveragehelp here?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
Are spaced seeds better with string kernels classifiers?Yes: see [Onodera and Shibuya, 2013, Ghandi et al., 2014]
Which spaced seed patterns are better? Does coveragehelp here?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
Are spaced seeds better with string kernels classifiers?Yes: see [Onodera and Shibuya, 2013, Ghandi et al., 2014]
Which spaced seed patterns are better? Does coveragehelp here?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
1 RFAM 11.0 database (50% training, 50% testing)
2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4
3 For each seed,
Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.
4 Compute the correlation coefficient betweenthe SVM zero/one error and :
the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
1 RFAM 11.0 database (50% training, 50% testing)2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4
3 For each seed,
Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.
4 Compute the correlation coefficient betweenthe SVM zero/one error and :
the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
1 RFAM 11.0 database (50% training, 50% testing)2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4
3 For each seed,
Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.
4 Compute the correlation coefficient betweenthe SVM zero/one error and :
the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
SVM classifiers
1 RFAM 11.0 database (50% training, 50% testing)2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4
3 For each seed,
Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.
4 Compute the correlation coefficient betweenthe SVM zero/one error and :
the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
3 SVM zero/one error
52 54 56 58 60 62 64 66
svm zero/one error
weight 3 single seed
111
1*1111*1
1**11
1*1*1
11**1
1***11
1**1*1
11***1
1****11
1***1*1
1**1**1
1*1***1
11****1
Good Bad
1*1**1
4 SVM zero/one error vs multi hit sensitivity for n = 5
0.89
0.9
0.91
0.92
0.93
0.94
52 54 56 58 60 62 64 66
multihitsensitivity
svm zero/one error
weight 3 single seed
111
1*1111*1
1**11
1*1*1
11**1
1***11 1**1*11*1**111***1
1****111***1*1
1**1**1
1*1***111****1
Alignment-free distances
Are spaced seeds better in estimating the “true”alignment distance?
Yes: see [Leimeister et al., 2014,Horwege et al., 2014, Boden et al., 2013] . . .
Which spaced seed patterns are better? Does coveragehelp here?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
Are spaced seeds better in estimating the “true”alignment distance? Yes: see [Leimeister et al., 2014,Horwege et al., 2014, Boden et al., 2013] . . .
Which spaced seed patterns are better? Does coveragehelp here?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
Are spaced seeds better in estimating the “true”alignment distance? Yes: see [Leimeister et al., 2014,Horwege et al., 2014, Boden et al., 2013] . . .
Which spaced seed patterns are better? Does coveragehelp here?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
1 Alignment length (e.g. l = 16, 32, 64)
2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4
3 For each seed,
Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and
the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)
on this alignment.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4
3 For each seed,
Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and
the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)
on this alignment.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4
3 For each seed,
Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and
the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)
on this alignment.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4
3 For each seed,
Generate any possible alignment of length l andmeasure it percentage of identity.
Compute the correlation coefficient betweenthe true percentage of identity of any alignment and
the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)
on this alignment.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4
3 For each seed,
Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and
the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)
on this alignment.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Alignment-free distances
Fixed alignment length and Variable Minimal % id:http://www.youtube.com/watch?v=YfQcF_GJ1jM
Variable alignment length and Variable Minimal % id :http://www.youtube.com/watch?v=LDenQv6HlEM
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Conclusion
A coverage criterion for spaced seeds
its applications to SVM string-kernels and k-mer distances
0.75
0.80
0.85
0.90
0.95
1.00
0.6
5
0.7
0
0.7
5
0.8
0
0.8
5
0.9
0
0.9
5
1.0
0
co
rre
lati
on
co
ve
rag
e d
ista
nc
e /
tru
e d
ista
nc
e
correlation multihit distance / true distance
True distance Correlation with MultiHit (x) vs Coverage (y) distance
x =
y
single seed (id ≥ 00.0%)double seed (id ≥ 00.0%)
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
Perspectives
Automaton size / building ways / generating function
Guessing most likely matches/mismaches distribution?ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA
111*1*11
111*1*11
111*1*11
Phylogenetic studies?
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
References I
Bao, E., Jiang, T., Kaloshian, I., and Girke, T. (2011).SEED: efficient clustering of next-generation sequences.Bioinformatics, 27(18):2502–2509.
Benson, G. and Mak, D. Y. (2008).Exact distribution of a spaced seed statistic for DNA homologydetection.In Proceedings of the International Symposium on String Processingand Information Retrieval (SPIRE), volume 5280 of LNCS, pages282–293.
Boden, M., Schoneich, M., Horwege, S., Lindner, S., Leimeister, C.,and Morgenstern, B. (2013).Alignment-free sequence comparison with spaced k-mers.In Proceedings of the German Conference on Bioinformatics (GCB),volume 34 of OpenAccess Series in Informatics (OASIcs), pages24–34.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
References II
Chong, Z., Ruan, J., and Wu, C.-I. (2012).Rainbow: an integrated tool for efficient clustering and assemblingRAD-seq reads.Bioinformatics, 28(21):2732–2737.
Ghandi, M., Lee, D., Mohammad-Noori, M., and Beer, M. A.(2014).Enhanced regulatory sequence prediction using gapped k-merfeatures.PLoS Computational Biology, 10(7):e1003711.
Hauser, M., Mayer, C. E., and Soding, J. (2013).kClust: fast and sensitive clustering of large protein sequencedatabases.BMC Bioinformatics, 14(248).
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
References III
Horwege, S., Lindner, S., Boden, M., Hatje, K., Kollmar, M.,Leimeister, C.-A., and Morgenstern, B. (2014).Spaced words and kmacs: Fast alignment-free sequence comparisonbased on inexact word matches.Nucleic Acids Research, 42(W1):W7–W11.
Leimeister, C.-A., Boden, M., Horwege, S., Lindner, S., andMorgenstern, B. (2014).Fast alignment-free sequence comparison using spaced-wordfrequencies.Bioinformatics, 30(14):1991–1999.
Martin, D. E. K. (2013).Coverage of spaced seeds as a measure of clumping.In JSM Proceedings, Statistical Computing Section, Alexandria,Virginia. American Statistical Association.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications
References IV
Martin, D. E. K. and Noe, L. (2014).Faster exact probabilities for statistics of overlapping patternoccurrences.Submitted to the Ann. I. Stat. Math.
Onodera, T. and Shibuya, T. (2013).The gapped spectrum kernel for support vector machines.In Proceedings of the International Conference on Machine Learningand Data Mining in Pattern Recognition (MLDM), volume 7988 ofLNCS, pages 1–15.
Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications