10
BLAST: A Case Study Lecture 25

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

Embed Size (px)

Citation preview

Page 1: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

BLAST: A Case Study

Lecture 25

Page 2: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

BLAST: Introduction

The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

BLAST was developed to find sequences of nucleotides or amino acids in a database that match a query sequence.

For example, searching the human genome for

AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT

produces a list of sequences scored by similarity.

This system helps scientists find genetic homologues across individuals and species.

Page 3: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

Using BLAST

There are several interfaces to BLAST, and it often appears as one component of a larger suite of informatics tools.

National Center for Biotechnology Information (NCBI) hosts the primary website and a server farm dedicated to BLAST.

From here, a user

• enters a query,

• selects a database,

• chooses a variant of BLAST to use, and

• sets program parameters

Results appear in seconds.

Page 4: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

BLAST Results

The NCBI BLAST tool returns results in several modes, with information centered around similarity scores.

In addition to a list of matches, the tool returns

a graphical view of the list that visualizes the alignments,

a detailed textual view of each match,

and a mapping of the matches to a visual representation of an entire genome.

Page 5: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

How BLAST Works (Stage 1)

The core BLAST algorithm has three distinct stages.

In the first stage, the system splits the query sequence into constant-sized words.

Assuming the constant, W, is 4, the nucleotide query

AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT

produces the words

AGCT GCTT CTTT … GCCT CCTT CTTT

BLAST matches these against every possible four letter word from the language to build similarity scores.

The subset of words whose similarity scores exceed a threshold move on to later stages, the rest are discarded.

Page 6: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

Side Note: Similarity in BLAST

To score the similarity of two words, BLAST builds a table based on edit distances.

For example, comparing AGCT to ACCC could give a score of 1, whereas comparing it to GGCT would give 3.

However, some substitutions (due to mutation) are more likely than others, especially in the case of amino acids.

BLAST accepts a scoring matrix for protein strings (e.g., Point Accepted Mutations 70).

For nucleotide strings, users can specify distinct scores for matches and mismatches.

BLAST also includes procedures for identifying and penalizing gaps.

Page 7: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

How BLAST Works (Stages 2 and 3)

At this point, BLAST has built a set of W-length words that exceed a user-provided threshold.

During the second stage, the system searches for all occurrences of these words within the database.

In the third stage, BLAST extends each of these W-length matches to get the final similarity score.

The system also calculates the E-value for the score, which is a statistical measure of significance.

Page 8: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

Knowledge and Search in BLAST

BLAST differs from many of the informatics tools that we have considered in the course.

Essentially it finds a sequence’s nearest neighbors within a database with minimal concern for the content.

Unlike discovery or analysis tools, BLAST gathers information and leaves the interpretation to the user.

However, like many discovery tools, BLAST relies on domain knowledge to carry out heuristic search.

Knowledge: match/mismatch costs for amino acid and nucleotide sequences

Heuristic Search:

an approximate scoring scheme, tells BLAST where to look more closely

Page 9: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

What Makes BLAST a Successful Tool?Google Scholar identifies over 28,000 citations of the original BLAST paper.

One of the key reasons for the system’s popularity is that it addresses problems commonly encountered in biology:

• finding genetic homologues across organisms; and

• determining the source organism of a sequenced genome (e.g., the Global Ocean Sampling Expedition).

Technical issues also contributed to BLAST’s success:

• it was much faster than competing software;

• it was distributed and maintained by the National Institute of Health;

• it has continually evolved to meet new challenges and to integrate with new databases and other technologies.

Page 10: BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters

BLAST: Summary

A key insight in BLAST was to iteratively refine a solution:

• find a reduced set of short words to use as a heuristic for locating similar strings;

• find matches to those short words and extend them to refine the candidate solution.

This strategy accounts for the computational gains that this system makes over others that seek exact comparisons.

The continued success of BLAST is attributable to

• the speed in which it can find sequence matches,

• its availability over the internet,

• its integration with other biological tools, and

• the fact that it addresses a specific need of biologists.