10
Analyzing Sequencing Electropherograms -by Scott Nicholson- Your sequence data arrives in two separate file formats: a .txt file and an .ab1 file. The .txt file is just the sequence, the .ab1 is the electropherogram. It’s tempting to just copy and BLAST the sequence from the .txt file and go. But, you should always look at your sequencing electropherograms. These will help you get a handle on the quality and reliability of your sequence, and will reveal any problems in template purification or primer specificity. It will also display any evidence of sequence heterozygosity or divergence. In cases of sequencing failure, the electropherogram can guide you to success. What are we looking at in the sequencing electropherogram? The sequencing electropherogram is a chart of light emission over time. Your sample is processed in a series of Sanger reactions that generate a pool of labelled fragments ranging in size from the primer plus one base to however many bases are in your template (and occasionally beyond – addressed below). The fragments are labelled with one of four dyes attached to a modified nucleotide base. Look up Sanger Sequencing to find out more. These fragments are run through a capillary. When they reach the end of the capillary, they’re blasted with a laser. The dyes emit at different wavelengths, and the intensity of each of these wavelengths is measured throughout the run at a frequency of about 3 Hz. Now let’s do some exploration of how to interpret electropherograms and how they can be used to solve your sequencing problems. Let’s start with an example of what you would really like to see. An Excellent Template: Let’s start with an excellent template and primer run in an optimal reaction. The beginning, middle, and end of the sequence are shown below: Beginning: In the beginning, things are kind of rough. We’re talking about very small fragments that tend to produce messy peaks, so sequence before 30 is usually not reliable. From there on, it’s very

Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

  • Upload
    lyhuong

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

Analyzing Sequencing Electropherograms

-by Scott Nicholson-

Your sequence data arrives in two separate file formats: a .txt file and an .ab1 file. The .txt

file is just the sequence, the .ab1 is the electropherogram. It’s tempting to just copy and

BLAST the sequence from the .txt file and go. But, you should always look at your sequencing

electropherograms. These will help you get a handle on the quality and reliability of your

sequence, and will reveal any problems in template purification or primer specificity. It will

also display any evidence of sequence heterozygosity or divergence. In cases of sequencing

failure, the electropherogram can guide you to success.

What are we looking at in the sequencing electropherogram?

The sequencing electropherogram is a chart of light emission over time. Your sample is processed

in a series of Sanger reactions that generate a pool of labelled fragments ranging in size from the

primer plus one base to however many bases are in your template (and occasionally beyond –

addressed below). The fragments are labelled with one of four dyes attached to a modified

nucleotide base. Look up Sanger Sequencing to find out more.

These fragments are run through a capillary. When they reach the end of the capillary, they’re

blasted with a laser. The dyes emit at different wavelengths, and the intensity of each of these

wavelengths is measured throughout the run at a frequency of about 3 Hz.

Now let’s do some exploration of how to interpret electropherograms and how they can be used

to solve your sequencing problems. Let’s start with an example of what you would really like to

see.

An Excellent Template:

Let’s start with an excellent template and primer run in an optimal reaction. The beginning,

middle, and end of the sequence are shown below:

Beginning:

In the beginning, things are kind of rough. We’re talking about very small fragments that tend to

produce messy peaks, so sequence before 30 is usually not reliable. From there on, it’s very

Page 2: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

good. Look at the gray bars above the sequence identities – anything above the green dashed

line has a Q value over 20, which is considered very good (>99% probability of correct call). Note

the tall, narrow peaks – that’s what you want.

Middle:

When you get around 600 bases out, peaks start to decline in height a bit, but still produce very

reliable sequence data. Since we’re using dye terminator technology, the sequence stops where

the dye binds. The longer your sequence is, the less of it there’s going to be, so the peaks get

smaller as the fragment gets bigger.

End:

With an exceptional template, you can get up to 1,000 reliable bases and slightly beyond. Notice

here that peak quality begins to degrade after about 900 (the peaks become broad and jagged),

and the base Q value drops below acceptability after 1000. Anything beyond that should be

ignored, it’s low-confidence data.

Let’s also look at the raw data from this run:

The Y-axis is intensity, and the X-axis is time expressed in scan numbers (the instrument scans

about 3 times per second). The red horizontal curved line is just a hand-drawn general trend

line to demonstrate the relative shape of a good capillary trace. With most any protocol, it

takes about 2400 scans for the initial products to travel down the capillaries to the detector.

Then data is constantly recorded as each reaction product passes across the detector. You can

see that the peaks are still pretty high (about 2000-2500) at the end of the run, and there were

probably quite a few even longer fragments left when recording stopped. These fragments are

not worth reading, however, because their peaks are too broad to produce reliable basecalling

Page 3: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

(the peak resolution with larger fragments is very low – you can see this by comparing peaks in

scans 2500-15500 with the peaks afterward – peak broadening is very noticeable after about

15500).

Sequencing Problems We Have Seen, and their solutions (if any).

DNA contamination

This is a mess, and none of the bases called are reliable. What you have is a combination of

sample contamination, mixed templates (there are multiple primer-binding sites in the sample),

and low DNA concentration. Note that Dye blobs are also present in this sample (the large broad

peaks around 80 and 110 - explained below). This sample somehow retained dye blobs through

multiple purifications, which is excellent evidence of some sort of severe contamination. The

reason for this was not determined, but switching sample prep methods eliminated this artefact

later on. Taking a look at the raw data for this run (in Finch TV go to view > raw data), you see

this:

The Y-axis is intensity, and the X-axis is time expressed in scan numbers (the instrument scans

about 3 times per second). The peak intensity averages about 500 – 1000, which is generally ok,

but far from ideal (see the first example of good raw data above for comparison). The relatively

low peak intensity is a result of low DNA concentration combined with sample contamination.

Lower peak intensities than this can still provide excellent sequence data, but not here. The

reason for this problem is that you have too many peaks that are all superimposed on one

another, the contamination likely interfered with the migration patterns, and the result is a mess.

No set of peaks is clearly separated from another, and the raw data does not present an orderly

progression of peaks for the software to process. The cure for this is to generate more of your

template, gel purify, and make sure you remove any trace of alcohol, EDTA, or other reaction

inhibitors from your final prep.

Page 4: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

Very Low DNA Concentration

This template showed dye blobs and also had very low base quality, but did extend out past 500

bases. Notice the very noisy baseline and erratic peak height. Let’s look at the raw data:

The most apparent peaks here are the dye blobs; they range up to 2500 intensity units. The

sequence we’re seeing is so faint that it just barely rises above the baseline noise. That’s why

the sequence quality is so low, and also why it looks so noisy. The verdict here is low DNA

concentration. Take your sample to the speed-vac, or try the prep again (the better choice). Also

remember, quantification reading below about 20 ng/ul should be viewed with suspicion. As a

test, load some of your blank on your nanodrop and zap it a few times. What sort of reading do

you get, and how much do they fluctuate? If your readings jump from 0 to 11 to -7 ng/ul, your

error is + or – 9 ng/ul. Watch out for stuff like that. It’s always better to dilute from a high

concentration.

No Template:

Example A: Seeing Nothing at All

Example A: Raw Data

There’s an old song about the one you love saying nothing at all, and it’s very sad. Seeing nothing

at all on your electropherogram is also very sad. We see nothing but random noise here, and the

peaks are very short. A look at the raw data shows only random noise that varies plus or minus

about 50 intensity units from 0 (Note the y-axis scale – this baseline is magnified compared to

Page 5: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

other raw data plots, as there were no peaks at all). There’s no labelled product at all, because

there’s no DNA. A quantitation of the sample showed nothing. This customer re-prepped the

sample and it turned out fine, once there was actually DNA in the sample.

Example B: Nothing at all combined with primer self-complementarity

Here’s a classic example of the absence of template. The template was generated using the same

primers used to sequence, but there was no sequence data at all. There were fairly nice initial

peaks and nothing else. A look at the raw data showed:

What we’re seeing here is a case of primer self-complementarity, the primer had a little self-

homology and ran off a short copy of its 3’ end. The primer didn’t have an extremely high Tm,

and the Tm of the primer’s secondary structure was well below the annealing temperature of the

reaction. Further, the customer was able to generate a band (although a faint and fuzzy one)

using PCR with those same primers and template. What’s the deal here? We quantified the DNA,

and found that the sample had huge absorbance at 230, and the only absorbance at 260 was part

of the shoulder of the 230 peak. What absorbs at 230? Guanidine HCL and phenol are a couple,

which are components of many DNA isolation procedures. Here, a gel extraction procedure,

which has poor yield anyway, was done on the weak band, and only one wash was done. So,

most likely, some Guanidine HCL but little to no DNA was eluted. Guanidine HCL denatures

protein, but there was relatively little in the sample, and the sample itself only made up 15% of

the final reaction mixture, so the enzyme was still active at some level. Since the primer didn’t

have any template to bind to, it bound to itself. The lesson here is to follow your kit instructions

precisely.

Too Much Template:

A condensed view – Note the quality bars at top. This was a very clean sample, and produced

very nice peaks to begin with. However, the sequence was only good out to about 600 in the

best cases, and only went out to about 350 most of the time. The DNA concentration was about

Page 6: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

15X higher than we ask for. Reducing the DNA sample concentration lengthened the reads

considerably.

Too Little Template:

My template was 1 KB, but I’m only getting 300 bases! What gives?

Well, let’s look at the peak intensity in the raw data:

Notice how the peak intensities (neglecting initial peaks, they’re always stronger) start out at

about 200 and then decline to nothing? Compare that with the other examples in this guide.

Your sample concentration is too low. In this case, the DNA quality is excellent, so concentrating

the sample by speed-vas should work nicely. The concentration of this sample was less than 10%

of the required concentration.

Little to No Primer, or wrong primer:

This signal is 100% noise, with no amplification at all. There is a little unbound dye that got by

the column, but that’s all there is. Either the primer is absent or practically absent, or the primer

doesn’t bind at all to either the template or itself. In this case, the primer was over-diluted.

Remember, 5 pmol/uL = 5 nmol/mL = 5 umol/L = 5uM. Your 20 uM PCR primer stock is 20

pmol/uL. Diluting it by ¼ will bring your primers to the correct concentration.

Unknown combination of template secondary structure, primer secondary structure, and

contamination:

Page 7: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

Here we have the beginning of a sequence, but it declines rapidly and nothing can be determined.

Probably due to sample contamination, secondary structure in primer and template that

prevents elongation, and sample contamination by inhibitors. Note the exponential decline in

peak height over a very short distance. Not good. Come see us and let’s see what else we can

do.

Very Short Templates:

Here we have a short sequence that amplified nicely, but was too short to call using the standard

base-calling settings – you’ll notice there is no sequence information above the peaks, but the

peaks are nicely formed. We will re-analyze this sort of sequence using an alternate base-calling

procedure.

High G percentage

Notice how the signal strength declines immediately once the enzyme hits the poly-G sequence.

This can be remedied by using our high-GC protocol.

High G percentage Combined with Heterozygosity:

Here we have a precipitous decline in peak height and quality (see gray bars above base identity)

due to the presence of a sequence consisting mostly of G. We have a high-GC protocol that

usually takes care of this, but other DNA quality and concentration issues can complicate the

issue even when using that protocol (like what happened here). Also notice the presence of

heterozygosity (red circles). This was a very difficult sequence, and this was the best it got.

Poly-T (or A) homopolymer:

Example 1

Page 8: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

Example 2

Notice how the peaks are fine to begin with, but go crazy right after the poly-T region. This is due

to “polymerase slip”. The sequence afterward is pretty bad. That’s not that many T’s, and it

doesn’t seem fair, but we all know the answer to that complaint.

Sequence Data Continues After The End Of Your Template:

This happens occasionally. You see very sequence data that follows your primary sequence, but

peak height is greatly diminished.

See those very short peaks past the natural end of the sequence? Let’s zoom in:

There it is, a small amount of contaminating sequence is producing peaks just tall enough to

produce a string of poor and irresolute data. Notice the Q value bars are all lousy. Now you know

two things: Your prep contains an impurity. If it’s a plasmid prep, you picked up some

contamination somewhere (did you re-use a tip, perhaps?). If it’s a PCR prep, your primer most

likely binds poorly at another site. If you saw any additional bands on your gel, there you go. This

is not a problem as long as you delete the offending sequence. It can become a problem if the

contaminant sequence intensity is >10% of the primary template intensity (see Sample

Heterozygosity below). Keep it clean.

Page 9: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

Bleed-Over from Very Strong Sample in Adjacent Capillary:

This happens rarely. You see very high quality sequence that follows your primary sequence, but

no peaks. What the What!!? But, if you increase the peak magnification, here’s what you’ll see:

Those teensy, tinsy little peaks are an actual sequence, but it’s not from your sample at all. It’s

the detector picking up an emission profile from a very intense sample running in an adjacent

capillary. This is only a problem if you include this sequence in your analysis. The practical

solution is identical to what you do in the case of a low-quantity contaminating sequence: Find

out where your sequence ends and delete the rest.

Heterozygosity in your Target Sequence:

Example 1 (below): Heterozygosity from one point that extends throughout the remaining

sequence.

Example 2 (below): Heterozygosity within a definite region that resolves after a number of bases

– Heterozygosity is confined to this region.

Example 3 (below): Single Nucleotide Polymorphism. Notice how circled region contains two

separate peaks in the same spot. The taller peak is called (T), but you can see there is a C that is

nearly the same height, and in the same place. This means that you have two alleles, one

Page 10: Analyzing Sequencing Electropherogramsbiochemistry.okstate.edu/core-facilities-1/osu-dna-protein-core...Analyzing Sequencing Electropherograms ... in a series of Sanger reactions that

containing a T, and one substituting the T with a C (or vice versa) at the same location. Also notice

that the Q value for that base is low, this is because neither peak really stands out.

Everything’s great until you reach a certain point, then you see double peaks with low q values.

This is a heterozygous region in your target. If your organism is diploid, tetraploid, etc., you’ll run

into this eventually. You’re copying each allele, and both copies are present in the sequencing

reaction in fairly equal numbers. The cure for this is to clone your PCR product into a PCR cloning

vector (TA cloning) and then sequence a few of those plasmids. Then you’ll see your one messed-

up sequence become two well-resolved, but different, sequences.

Dye Blob

See those two giant peaks at around 80 and 120? That’s unbound dye that managed to seep

through the purification column. It happens sometimes. The software is designed to read

through dye blobs without affecting basecalling, but we typically will take your purified reaction

and run through the purification column again the next day, and the blobs will go away.

Now, go and bear fruit.

While this guide doesn’t cover every possible problem, it does cover 99.9% of the difficulties

you’ll see. And don’t forget, we’re here to help, so come by and have a chat if you have questions.