94
A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances Laurent No´ e, Donald E. K. Martin LIFL (UMR 8022 Lille 1/CNRS) - Inria Lille, Villeneuve d’Ascq, France Department of Statistics, North Carolina State University, Raleigh, NC, USA SeqBio 2014 November 4&5, 2014 - Montpellier Laurent No´ e, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances - presentation

Embed Size (px)

Citation preview

A coverage criterion for spaced seedsand its applications to SVM string-kernels and

k-mer distances

Laurent Noe, Donald E. K. Martin

LIFL (UMR 8022 Lille 1/CNRS) - Inria Lille, Villeneuve d’Ascq, FranceDepartment of Statistics, North Carolina State University, Raleigh, NC, USA

SeqBio 2014

November 4&5, 2014 - Montpellier

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Outline

1 Introduction to spaced seeds . . .2 Spaced seed coverage

DefinitionAssociated automatonPossible use (as a seed “quality” measure).

3 Experimental results

SVM classifiersAlignment-free distances

4 Conclusion

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Spaced Seeds(PatternHunter 02, Burkhardt et al 01, . . . )

Definition

A spaced seed π is defined as a binary word over the alphabet {1, *} :

1 : accepts only match symbol | ,

* : accepts all alignment symbols (joker) .

s : span (length), w : weight (number of 1).

Exampleπ = 111*1*11

111*1*11

ATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||ATCAGCGCAAATGCTCAAGA

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA||||||||||||||||||||ATCAGTGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCTCAAGA|||||.||||||||||||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCAAATGCGCAAGA|||||.||||||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Example

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111111.............................................111111

..........................................

...111111.......................................

......111111

....................................

.........111111

.................................

............111111

..............................

...............111111

...........................

..................111111

........................

.....................111111

.....................

........................111111

..................

...........................111111

...............

..............................111111

............

.................................111111

.........

....................................111111

.............................................111111

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11.......................................111*1*11

....................................

...111*1*11.................................

......111*1*11

..............................

.........111*1*11

...........................

............111*1*11

........................

...............111*1*11

.....................

..................111*1*11

..................

.....................111*1*11

...............

........................111*1*11

............

...........................111*1*11

.........

..............................111*1*11

.......................................111*1*11

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Recent work related to spaced seeds

1 Alignment-free distances[Leimeister et al., 2014, Horwege et al., 2014, Boden et al., 2013]

2 SVM classification[Onodera and Shibuya, 2013, Ghandi et al., 2014]

3 Read clustering[Bao et al., 2011, Chong et al., 2012, Hauser et al., 2013]

4 Metagenomic classification, . . .

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

“New Uses for Old Things”

little boy

⇒⇒⇒⇒

frying pan

1

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

“New Uses for Old Things”

little boy

⇒⇒⇒⇒

frying pan

1

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

“New Uses for Old Things”

little boy

⇒⇒⇒⇒

frying pan

1

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

“New Uses for Old Things”

little boy

⇒⇒⇒⇒

frying pan

1

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

“New Uses for Old Things”

little boy

⇒⇒⇒⇒

frying pan

1

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

⇒⇒⇒⇒ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

1http://arch5541.wordpress.com/2012/11/16/and-then-there-was-teflon/

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

Example

ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

ExampleATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

/ a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1πocc3

......

......

... 1... 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

/ a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1πocc3

......

......

... 1... 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

/ a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1

πocc2

......

...

1 1 * 1πocc3

......

...

...... 1

... 1

x = 1 0 1•

1•

1 1•

0 0 1 0 1 1 1 1 1

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

/ a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1

πocc3

......

......

...

1

...

1

x = 1 0 1•

1•

1 1•

0 0 1 0 1•

1•

1 1•

1

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed

/ a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1πocc3

......

...... 1 1 * 1

x = 1 0 1•

1•

1 1•

0 0 1 0 1•

1•

1•

1•

1•

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed / a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1πocc3

......

...... 1 1 * 1

x = 1 0 1•

1•

1 1•

0 0 1 0 1•

1•

1•

1•

1•

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed / a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1πocc3

......

...... 1 1 * 1

x = 1 0 1•

1•

1 1•

0 0 1 0 1•

1•

1•

1•

1•

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1

x = 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed / a set of seeds

alignment : x = 101111001011111

Example

seed : π = 11*1

πocc1 1 1 * 1πocc2

......

... 1 1 * 1πocc3

......

...... 1 1 * 1

x = 1 0 1•

1•

1 1•

0 0 1 0 1•

1•

1•

1•

1•

set of seeds : {π1, π2} = {11*1, 1*1*1}

π2 occ1 1 * 1 * 1π1 occ2

... 1 1 * 1π2 occ3

......

......

... 1 * 1 * 1π1 occ4

......

......

...... 1 1 * 1

π2 occ5

......

......

...... 1 * 1 * 1

π1 occ6

......

......

......

... 1 1 * 1x = 1

•0 1

•1•

1•

1•

0 0 1•

0 1•

1•

1•

1•

1•

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

{π1, π2} = {11*1, 1*1*1}

Coverage measure for a seed / a set of seeds

That’s how coverage can be measured,

estimated, computed on several models. . .

But, . . . is coverage useful?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Coverage measure for a seed / a set of seeds

That’s how coverage can be measured,

estimated, computed on several models. . .

But, . . . is coverage useful?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Experimental results

1 SVM classifiers2 Alignment-free distances

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

Are spaced seeds better with string kernels classifiers?

Yes: see [Onodera and Shibuya, 2013, Ghandi et al., 2014]

Which spaced seed patterns are better? Does coveragehelp here?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

Are spaced seeds better with string kernels classifiers?Yes: see [Onodera and Shibuya, 2013, Ghandi et al., 2014]

Which spaced seed patterns are better? Does coveragehelp here?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

Are spaced seeds better with string kernels classifiers?Yes: see [Onodera and Shibuya, 2013, Ghandi et al., 2014]

Which spaced seed patterns are better? Does coveragehelp here?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

1 RFAM 11.0 database (50% training, 50% testing)

2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4

3 For each seed,

Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.

4 Compute the correlation coefficient betweenthe SVM zero/one error and :

the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

1 RFAM 11.0 database (50% training, 50% testing)2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4

3 For each seed,

Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.

4 Compute the correlation coefficient betweenthe SVM zero/one error and :

the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

1 RFAM 11.0 database (50% training, 50% testing)2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4

3 For each seed,

Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.

4 Compute the correlation coefficient betweenthe SVM zero/one error and :

the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

SVM classifiers

1 RFAM 11.0 database (50% training, 50% testing)2 Single/double seeds of weight w = 3 . . . 4, span up tow + 4

3 For each seed,

Learn (classical string kernel with linear classifier).Measure the SVM zero/one error.

4 Compute the correlation coefficient betweenthe SVM zero/one error and :

the single hit criterion (at least one seed hit)the multi hit criterion (at least n seed hits)the coverage criterion (at least n seed coverage)

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

3 SVM zero/one error

52 54 56 58 60 62 64 66

svm zero/one error

weight 3 single seed

111

1*1111*1

1**11

1*1*1

11**1

1***11

1**1*1

11***1

1****11

1***1*1

1**1**1

1*1***1

11****1

Good Bad

1*1**1

4 SVM zero/one error vs multi hit sensitivity for n = 5

0.89

0.9

0.91

0.92

0.93

0.94

52 54 56 58 60 62 64 66

multihitsensitivity

svm zero/one error

weight 3 single seed

111

1*1111*1

1**11

1*1*1

11**1

1***11 1**1*11*1**111***1

1****111***1*1

1**1**1

1*1***111****1

4 SVM zero/one error vs sensitivity (3 criteria)

4 SVM zero/one error vs sensitivity (3 criteria)

Alignment-free distances

Are spaced seeds better in estimating the “true”alignment distance?

Yes: see [Leimeister et al., 2014,Horwege et al., 2014, Boden et al., 2013] . . .

Which spaced seed patterns are better? Does coveragehelp here?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

Are spaced seeds better in estimating the “true”alignment distance? Yes: see [Leimeister et al., 2014,Horwege et al., 2014, Boden et al., 2013] . . .

Which spaced seed patterns are better? Does coveragehelp here?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

Are spaced seeds better in estimating the “true”alignment distance? Yes: see [Leimeister et al., 2014,Horwege et al., 2014, Boden et al., 2013] . . .

Which spaced seed patterns are better? Does coveragehelp here?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

1 Alignment length (e.g. l = 16, 32, 64)

2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4

3 For each seed,

Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and

the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)

on this alignment.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4

3 For each seed,

Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and

the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)

on this alignment.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4

3 For each seed,

Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and

the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)

on this alignment.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4

3 For each seed,

Generate any possible alignment of length l andmeasure it percentage of identity.

Compute the correlation coefficient betweenthe true percentage of identity of any alignment and

the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)

on this alignment.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

1 Alignment length (e.g. l = 16, 32, 64)2 Single/double seeds of weight w = 2 . . . 9, span up tow + 4

3 For each seed,

Generate any possible alignment of length l andmeasure it percentage of identity.Compute the correlation coefficient betweenthe true percentage of identity of any alignment and

the multi-hit value of the seed (next plot : x-axis)the coverage value of the seed (next plot : y -axis)

on this alignment.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Alignment-free distances

Fixed alignment length and Variable Minimal % id:http://www.youtube.com/watch?v=YfQcF_GJ1jM

Variable alignment length and Variable Minimal % id :http://www.youtube.com/watch?v=LDenQv6HlEM

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Conclusion

A coverage criterion for spaced seeds

its applications to SVM string-kernels and k-mer distances

0.75

0.80

0.85

0.90

0.95

1.00

0.6

5

0.7

0

0.7

5

0.8

0

0.8

5

0.9

0

0.9

5

1.0

0

co

rre

lati

on

co

ve

rag

e d

ista

nc

e /

tru

e d

ista

nc

e

correlation multihit distance / true distance

True distance Correlation with MultiHit (x) vs Coverage (y) distance

x =

y

single seed (id ≥ 00.0%)double seed (id ≥ 00.0%)

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

Perspectives

Automaton size / building ways / generating function

Guessing most likely matches/mismaches distribution?ATCAGTGCGAATGCGCAAGA|||||.||.|||||.|||||ATCAGCGCAAATGCTCAAGA

111*1*11

111*1*11

111*1*11

Phylogenetic studies?

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

References I

Bao, E., Jiang, T., Kaloshian, I., and Girke, T. (2011).SEED: efficient clustering of next-generation sequences.Bioinformatics, 27(18):2502–2509.

Benson, G. and Mak, D. Y. (2008).Exact distribution of a spaced seed statistic for DNA homologydetection.In Proceedings of the International Symposium on String Processingand Information Retrieval (SPIRE), volume 5280 of LNCS, pages282–293.

Boden, M., Schoneich, M., Horwege, S., Lindner, S., Leimeister, C.,and Morgenstern, B. (2013).Alignment-free sequence comparison with spaced k-mers.In Proceedings of the German Conference on Bioinformatics (GCB),volume 34 of OpenAccess Series in Informatics (OASIcs), pages24–34.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

References II

Chong, Z., Ruan, J., and Wu, C.-I. (2012).Rainbow: an integrated tool for efficient clustering and assemblingRAD-seq reads.Bioinformatics, 28(21):2732–2737.

Ghandi, M., Lee, D., Mohammad-Noori, M., and Beer, M. A.(2014).Enhanced regulatory sequence prediction using gapped k-merfeatures.PLoS Computational Biology, 10(7):e1003711.

Hauser, M., Mayer, C. E., and Soding, J. (2013).kClust: fast and sensitive clustering of large protein sequencedatabases.BMC Bioinformatics, 14(248).

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

References III

Horwege, S., Lindner, S., Boden, M., Hatje, K., Kollmar, M.,Leimeister, C.-A., and Morgenstern, B. (2014).Spaced words and kmacs: Fast alignment-free sequence comparisonbased on inexact word matches.Nucleic Acids Research, 42(W1):W7–W11.

Leimeister, C.-A., Boden, M., Horwege, S., Lindner, S., andMorgenstern, B. (2014).Fast alignment-free sequence comparison using spaced-wordfrequencies.Bioinformatics, 30(14):1991–1999.

Martin, D. E. K. (2013).Coverage of spaced seeds as a measure of clumping.In JSM Proceedings, Statistical Computing Section, Alexandria,Virginia. American Statistical Association.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

References IV

Martin, D. E. K. and Noe, L. (2014).Faster exact probabilities for statistics of overlapping patternoccurrences.Submitted to the Ann. I. Stat. Math.

Onodera, T. and Shibuya, T. (2013).The gapped spectrum kernel for support vector machines.In Proceedings of the International Conference on Machine Learningand Data Mining in Pattern Recognition (MLDM), volume 7988 ofLNCS, pages 1–15.

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications