103
Simulation Study on the Methods for Mapping Quantitative Trait Loci in Inbred Line Crosses A Dissertation Submitted in Partial Fulfilment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY in Genetics and Plant Breeding by SHENGCHU WANG Zhejiang University Hangzhou, Zhejiang, China 2000

Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Simulation Study on the Methods for Mapping Quantitative Trait Loci

in Inbred Line Crosses

A Dissertation Submitted in Partial Fulfilment of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY in

Genetics and Plant Breeding

by

SHENGCHU WANG

Zhejiang University

Hangzhou, Zhejiang, China 2000

Page 2: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

A Ph.D. DISSERTATION

Simulation Study of the Methods for Mapping

Quantitative Trait Loci in Inbred Line Crosses

By

Shengchu Wang

Major: Genetics and Plant Breeding

Supervisors: Dr. Jun Zhu and Dr. Zhao-Bang Zeng

Zhejiang University

Hangzhou, Zhejiang China

2000

Page 3: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

DEDICATION

To My Wife, Xiu-Juan Rong

And Daughter, Min-Xue Wang

Page 4: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Acknowledgments

I like to express my special thanks to my advisor Dr. Jun Zhu for his important

directions, encouragement and support for my doctoral study and dissertation research.

The experience of studying with Dr. Zhu was beneficial and unforgettable.

I would like to express my sincere thanks to Dr. Zhao-Bang Zeng for supporting

me financially to do part of my dissertation research in US and giving me a lot of

helps in my research work and my life while I stayed at NCSU, US. Thanks also to Dr.

Bruce Weir for furnishing me the host lab and for good advice on my research work. I

would like to express my gratitude to my wife and daughter for their support and

patience.

I am grateful to Dr. Xin-Fu Yan, Dr. Yue-Fu Liu, Dr. Rong-Ling Wu, Hai-Ming Xu,

Ci-Xin He, and everyone who helped me during my dissertation research. I also wish

to express my thanks to my colleagues of computer centre, Zhejiang University, for

their support on my doctorial study and the dissertation research.

Page 5: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Abstract

As the fast advance in molecular genetics, it is much easy to get well-distributed

genetic markers in almost every organism nowadays. Therefore, as the major

direction of quantitative genetics, vary statistical methods have been developed to

detect or map quantitative trait loci (QTL) by using the genetic marker information. In

this dissertation, the principles and models have been summarized for various QTL

mapping methods. These methods include single marker analysis, interval mapping

(IM), composite interval mapping (CIM), mixed-model-based composite interval

mapping (MCIM), and multiple interval mapping (MIM).

A large scale of simulation studies has been used for exploring and comparing

various QTL mapping methods. The simulation study has indicated that although the

single marker analysis has the ability to detect the QTLs but it cannot locate the

positions of the QTL and obtain the estimation of the QTL effects.

Simulations have also been conducted for studying and comparing different

methods (IM, CIM, and MCIM) of QTL mapping under the simple additive situation.

By analysing the LR profile, the power of QTL detection and the probability of false

QTL detected can be calculated for the three methods under various situations. The

estimation of QTL effects and positions as well as their 95% experimental confidence

interval (ECI) for the detected QTLs is also obtained. The simulation results are

useful to those who are using these three methods for QTL mapping practices. The

results could be used as one of the bases for chosen the QTL mapping method among

the available methods for a particular experiment design. The research can also

provide the information for helping the analysis of the QTL mapping result.

However, under the real QTL mapping experiments, more complicated situations

such as QTL by environment interactions and QTL epistasis are existed generally. For

IM and CIM methods, the simulation studies implied that the estimation of QTL main

effects can be obtained unbiased by using data for all environments together.

However, it is difficult to obtain the estimation of QE interaction effects, even by

Page 6: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

doing QTL mapping on the data for different environment separately. MCIM method

has the ability to put all QTL main effects and QE interaction effects into the mixed

linear model and obtained the unbiased estimation of main and QE effects as

indicated by the simulation study work.

MCIM method can also use mixed linear model for mapping QTLs with marginal

and epistatic effects. The simulation study has indicated that MCIM method can

obtain the unbiased estimation of QTL marginal and epistatic effects at the same time.

Although IM and CIM have the ability to get the unbiased estimation of marginal

QTL effects when the QTL epistatic effects are existed, the variance for the

estimation of marginal effects will increase largely too. On the other hand, the

detection power of QTLs will go down and the probability of false QTL detection will

go up apparently, especially for the CIM method as the simulation study indicated.

MIM is a multiple QTL oriented method and it also has the ability to analysis the

QTL epistatic effects. However, one of the crucial parts for MIM method is the

criteria or stopping rule for model selection. We proposed a set of parameters for

measuring the fitness between the selected model and the real model and an

experimental criterion has been presented for model selection in the framework of

QTL mapping by using simulation method. The criterion is a modification of BIC by

adding relevant facts such as heritability, marker density, sample size, and

chromosome numbers. The experimental criterion works fine in the simulation cases.

A modified software version of QTL Cartographer has been developed and it is

called Windows QTL Cartographer. Unlike original QTL Cartographer, Windows

QTL Cartographer is the QTL mapping software with user-friend interface and

powerful ability of graphic presentation for the mapping results. It has many users and

been posted on the Internet: (http://statgen.ncsu.edu/qtlcart/WQTLCart.htm).

Key words: Computer simulation; QTL mapping methods; Quantitative trait loci; Model selection; BIC criterion

Page 7: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

TABLE OF CONTENTS

1. INTRODUCTION 1

1-1 HISTORY OF THE QTL MAPPING WORK ...................................................................................3 1-2 MOLECULAR MARKERS ..........................................................................................................6 1-3 EXPERIMENTAL DESIGN ..........................................................................................................9 1-4 MODELS AND SOFTWARE ......................................................................................................11 1-5 SIMULATION VS. REAL DATA.................................................................................................13 1-6 MAP FUNCTIONS AND MARKER ANALYSIS ...........................................................................15

1. Map Functions ....................................................................................................................15 2. Marker Order Analysis........................................................................................................17 3. Marker Segregation Analysis ..............................................................................................20

1-7 PURPOSE OF THIS RESEARCH ................................................................................................21

2. REVIEW OF MAJOR QTL MAPPING METHODS 22

2-1 ONE MARKER METHOD.........................................................................................................22 1. Statistic Bases for One Marker Method ..............................................................................22 2. The t -test Method..............................................................................................................23 3. Likelihood Ratio Test Method..............................................................................................24 4. Simple Regression Method ..................................................................................................25

2-2 INTERVAL MAPPING METHOD ...............................................................................................25 1. Conditional Probabilities of QTL Genotypes......................................................................26 2. Genetic Model .....................................................................................................................27 3. Maximum Likelihood Analysis ............................................................................................28 4. Likelihood Ratio Test...........................................................................................................29

2-3 COMPOSITE INTERVAL MAPPING ...........................................................................................30 1. Properties of Multiple Regression Analysis ........................................................................31 2. Genetic Model .....................................................................................................................33 3. Likelihood Analysis .............................................................................................................33 4. Hypothesis Test....................................................................................................................34 5. Marker Selection .................................................................................................................34

2-4 MIXED LINEAR MODEL APPROACH ......................................................................................35 1. Genetic Model .................................................................................................................36 2. Likelihood Analysis .............................................................................................................36 3. Hypothesis Test....................................................................................................................37 4. A Model for GE Interaction ................................................................................................37 5. A Model for QTL Epistasis..................................................................................................38

2-5 MULTIPLE INTERVAL MAPPING .............................................................................................39

3. SIMULATION STUDIES 41

3-1 SIMULATION MODEL AND DATA............................................................................................41 1. Genetic Model for Simulation .............................................................................................41 2. Parameter Setting ...............................................................................................................42

−1−

Page 8: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

3. Simulation Procedure..........................................................................................................43 4. Format of the Simulation Data ...........................................................................................44

3-2 SINGLE MARKER ANALYSIS ..................................................................................................45 3-3 COMPARING DIFFERENT MAPPING METHOD.........................................................................47

1. Parameters Setting ..........................................................................................................47 2. Estimation of QTL Effects ...............................................................................................47 3. Power and False Positive................................................................................................48 4. Positions and Effects of Detected QTLs ..........................................................................52 5. The LR Profile .................................................................................................................54

3-4 CONSIDER THE COMPLICATED QTL MAPPING SITUATIONS ..................................................56 1. Parameters Setting ..........................................................................................................56 2. Performance of IM and CIM Methods ............................................................................57 3. Using MCIM Method ......................................................................................................62

4. MODEL SELECTION AND CRITERIA 65

4-1 MIM AND MODEL SELECTION ..............................................................................................65 4-2 MODEL EVALUATION STANDARD ..........................................................................................66 4-3 MODEL SELECTION STRATEGY AND CRITERIA......................................................................67 4-4 PROCEDURE OF MODEL SELECTION ......................................................................................69 4-5 SUMMARY OF CRITERIA FOR MODEL SELECTION .................................................................71

1. Adjusted R2..........................................................................................................................71 2. Mallow’s Cp (Mallows 1973)...............................................................................................71 3. Mean Squared Error Prediction (Aitkin 1974, Miller 1990)...............................................71 4. BIC and Related Criteria ....................................................................................................72

4-6 SIMULATION STUDIES OF CRITERIA.......................................................................................73 1. FW and BW Methods ..........................................................................................................74 2. Criteria and the Various Parameters ..................................................................................74 3. Experimental Criteria .........................................................................................................77

5. CONCLUSIONS AND DISCUSSION 80

5.1 CONCLUSION .........................................................................................................................80 5.2 THRESHOLD AND CRITERIA ...................................................................................................82 5.3 SOFTWARE DESIGN................................................................................................................84

REFERENCE 88

−2−

Page 9: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

1. Introduction

1-1 History of the QTL mapping work

It is believed that the rediscovery of Mendelian genetics in 1900 was beginning of

the modern genetics. Through the demonstration on the inheritance of discrete

characters, such as purple vs. white flower, smooth vs. wrinkled seeds, it is clear that

the traits are controlled by genetics factors or genes, which will, inherited from

generation to generation. Later on, great efforts have been made on understanding

how the genes effecting the discrete characters or qualitative traits, especially the

nature of the genes to transmit from the parents to their offspring.

However, most economically as well as biologically important traits are not

qualitative, but quantitative in nature. Here the quantitative means that the trait’s

value cannot be divided into several categories and the distribution of these values is

continuously over a range in a population. The examples of the quantitative trait are of

crop yield, plant height, resistance to diseases, weight gain in mice and egg or milk

production in animals. Due to the complexity nature of the quantitative inheritance,

the progress of quantitative genetics is far behind the Mendelian genetics. To partition

phenotypic variance into various genetic and non-genetic variance components is the

traditional way to study the quantitative traits.

VP = VG+Ve = VA+VD+VI+Ve

Here the phenotypic variance VP is partitioned into two components: genetic part

VG and environmental and residual part Ve. The genetic variance can be further

partitioned into additive VA, dominance VD and epistatic VI variances. It is also

possible to partition VG into other variance components according to the applications.

For example:

VG = VA+VD+VL+VM

where VL is the sex linkage component and VM is the maternal variance component

(Zhu and Weir, 1996).

−3−

Page 10: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

These variance components can be estimated under the special breeding designs

(Cockerham 1961, Eberhart et al. 1966, Falconer, 1996, Zhu 1998). These estimations

allow us to evaluate the relative importance of various determinants of the phenotypic

variance. The ratio PG VV is called as heritability in broad sense and PA VV is

called as heritability in narrow sense or just heritability (h2). Heritability measures the

degree that genes transmitted from parents to their offspring comparing to phenotypic

deviation and it is useful in predicting the response to selection.

The questions how the genes contribute to the quantitative trait values and why

the trait values are continuously distributed may be answered partially by polygene

theory (Johannsen 1909, Nilsson-Ehle 1909, East 1916). In this theory, a quantitative

trait is controlled by many genes with small effects, and at the same time is also

influenced easily by environment effects. However, it is very difficult to dissect the

individual genes that controlling a quantitative trait by classical quantitative genetic

means. Therefore, Breeders usually have no idea about the number, location and

effect of the individual genes involved in the inheritance of target quantitative traits

(Comstock 1978). These genes are also called quantitative trait loci (QTLs). It is

impossible to manipulate the QTLs using genetic engineering method and through

that to improve the organism’s traits without obtaining the QTLs information, such as

number, locations, and effects.

The history of QTLs mapping can be traced back to 1920’s. Sax (1923) used the

morphological markers to demonstrate an association between seed weight and seed

coat colour in beans. Thoday (1961) used multiple genetic markers to systematically

map the individual polygenes, which control a quantitative trait. He notices: “The

main practical limitation of the technique seems to be the availability of suitable

markers”. It is obvious that the numbers of the morphological or protein markers are

very limited. Therefore, genetic markers are the nature choice for detecting or

mapping QTLs.

Nowadays, it is much easy to get well-distributed genetic markers in almost every

organism, because the fast advance of molecular genetic technology. Vary statistical

−4−

Page 11: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

methods have been developed to detect or map QTLs by using genetic markers

information. Lander and Botstein (1989) proposed the interval mapping method (IM),

which use two adjacent markers to bracket a region for testing the existence of a QTL

by performing a likelihood ratio test at every position in the region. The method has

been proven more powerful and requiting fewer progeny than one-marker methods.

However, interval-mapping method has some drawback. Because it is a one QTL

model, the mapping position of QTLs will be seriously biased when more than one

QTL located at same chromosome (Knott and Haley 1992; Martinez and Curnow

1992).

Later on, several attempts have been made to solve this problem. Zeng (1993)

proved an important property of multiple regression analysis in relation to QTL

mapping: “If there is no epistasis, the partial regression coefficient of a trait on a

marker depends only on those QTLs that are in the interval bracketed by the two

neighbouring markers and is independent of QTLs located in other intervals”. Zeng

(1994) proposed an improved method called composite interval mapping (CIM) by

combining interval mapping with multiple regression analysis. Jansen (1993) has also

proposed a similar strategy. Composite interval mapping has proved having a better

performance than interval mapping in multiple linked QTLs case. Recently an

extended method called multiple interval mapping (MIM) has been proposed (Kao,

Zeng and Teasdale 1999). This method fits all QTLs into the model altogether and has

the ability for analysing QTL epistasis and the associated statistical issues.

A new methodology was also proposed (Zhu, 1998, 1999; Zhu and Weir, 1998)

for systematically mapping QTLs based on the mixed linear model approaches

(MCIM). The MCIM method has very similar performance with Zeng’s CIM

method (See chapter 3). However, MCIM method does not have the problem of

selecting the background control markers and setting the mapping windows size as

CIM method does. MICM method also has the advantage that is very easy to extend

for more complicated QTLs mapping situations such as QTL epistasis and QTL by

environmental interaction etc.

−5−

Page 12: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

1-2 Molecular Markers

In classical Mendelian approach, the units of analysis are genetic variances rather

than the underlying genes themselves. However, individual QTL can be dissected by

using linked marker loci. This approach has long been recognized (Sax 1923;

Rasmusson 1933; Thoday 1961), but until recently it has been regarded as of minor

importance because of the lack of sufficient genetic markers. Thanks to modern

molecular biology, this situation has now been changed dramatically. The ability to

detect genetic variation directly at the DNA level has resulted in an essentially endless

supply of markers for any species of interest. Not surprisingly, there has been an

explosion in the use of marker-based methods in quantitative genetics.

The first molecular markers used were allozymes, protein variants detected by

differences in migration on starch gels in an electric field. This class of markers has

been extensively applied to a variety of genetic problems (Tanksley and Rick 1980;

Delourme and Eber 1992; Baes and Van Cutsem 1993; Kindiger and Vierling 1994).

Allozymic variants have the advantage of being relatively inexpensive to score in

large numbers of individuals, but there is often insufficient protein variation for

high-resolution mapping. This is the reason why the rapid development of QTL

mapping did not start with the advent of allozymic markers.

As methods for evaluating variation directly at the DNA level became widely

available during the mid-1980s, DNA-based markers largely replaced allozymes in

mapping studies. DNA is the genetic material of organisms and genetic differences

between individuals will be reflected directly by the nucleotide sequences of DNA

molecules. There are effectively no limitations on either the genomic location or the

number of DNA markers.

A wide variety of techniques can be used to measure DNA variation. Direct

sequencing of DNA provides the ultimate measure of genetic variation, but much

quicker scoring of variation is sufficient for most purposes. These methods include

Restriction Fragment Length Polymorphisms (RFLPs), Polymerase Chan Reaction

(PCR), Randomly Amplified Polymorphic DNAs (RAPDs) and microsatellite DNAs

−6−

Page 13: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

etc. There are several recently developed methods that include Representational

Difference Analysis (RDA) and Genomic Mismatch Scanning (GMS).

RFLPs is one of the simplest and wide used types of DNA marker. The approach

is to digest DNA with a variety of restriction enzymes, each of which cuts the DNA at

a specific sequence or restriction site. When the digested DNA is run on a gel under

an electric current, the fragments separate out according to size. A variety of DNA

from different individuals can generate length variation. If we attempted to score the

entire genome for fragment lengths, the result would be a complete smear on the gel.

Instead, individual bands are isolated from this smear by using labelled DNA probes

that have base pair complementarily to particular regions of the genome. Each RFLP

probe generally scores a single marker locus, and the marker alleles are codominant,

as heterozygotes and homozygotes can be distinguished. The first use of the RFLP

markers is in construction of human genetic map (Botstein et al. 1980; Doris-Keller et

al. 1987), and this has been extended to analysis for other species (Beckmann and

Soller 1983, 1986a, 1986b; Soller and Beckmann 1988).

PCR is a rather different molecular marker approach that uses short primers for

DNA replication to delimit fragment sizes. A opposite orientated region flanked by

primer binding sequences that are sufficiently close together allows the PCR reaction

to replicate this region, generation an amplified fragment. If primer-binding sites are

missing or are too far apart, the PCR reaction fails and no fragments are generated for

that region. RAPDs method (Williams et al. 1990) has the similar procedure that the

sequence polymorphisms are detected by using random short sequences as primer.

The advantage is that a single probe can reveal several loci at once, each

corresponding to different regions of the genome with appropriate primer sites. They

also require smaller amounts of DNA. However, RAPDs markers are dominant and

the marker genotype can be ambiguous. Ragot and Hoisington (1993) conclude that

RAPDs are suitable for modest number of individuals, while RFLPs are better for

larger studies.

Microsatellite DNAs, short arrays of simple repeated sequences tend to be very

highly polymorphic. Since array length is cored, microsatellites are codominant, as

−7−

Page 14: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

heterozygotes show two different lengths and hence can be distinguished from

homozygotes. This kind of marker is especially suitable for outbred population

because it is most efficient with marker loci having a large number of alleles.

RDA and GMS are two recently developed advance methods. Both methods

examine the entire genome, allowing one to isolate only those sequences that are

shared by two populations (GMS) or those that differ between populations (RDA).

Good use of these methods will very likely provide powerful approaches for the

isolation of QTLs (Lander 1993, Aldhous 1994). Besides above commonly used

markers, other categories of markers can also be very useful in some cases.

The linear arrangement of the markers along the chromosomes or genome for the

species is called marker linkage map. The map information is very important for vary

QTL mapping research work. There are many saturated marker maps, which means

markers covering whole genome in a reasonable distance, have been published in

many organisms (Halward et al. 1993, Xu et al. 1994, Causse et al. 1994, Viruel et al.

1995, Hallden at al. 1996). Based on these kind of saturated maps, many research

areas became more likely to be successful. These research works include studies on

evolutionary process of organisms through comparative mapping (Lagercrantz et al.

1996, Simon et al. 1997), marker assisted selection to improve breeding efficiency

(Lee 1995, Hamalainen et al. 1997) and marker based cloning (Xu 1994) etc.

It is necessary to distinguish between the ideas of the physical maps and the

genetic maps. The set of hereditary material transmitted from parent to offspring is

known as the genome, and it consists of molecules of DNA (DeoxyriboNucleic Acid)

arranged in chromosomes. The DNA itself is characterized by its nucleotide sequence

that is the sequence of bases A, C, G or T. A physical map is an ordering of features

of interest along the chromosome in which the metric is the number of base pairs

between features. This is the level of detail needed for molecular studies, and there are

several techniques available for physical mapping of discrete genetic markers or traits.

However, in this paper genetic map are the main concern and that is the distances

depending on the level of recombination expected between two points. An individual

receives one copy of each heritable unit (allele) from each parent at each location

−8−

Page 15: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

(locus) of the genome. The combination of units (haplotype) at different locations

(loci) that the individual transmits to the next generation need not be one of the

parental sets. Recombination may have taken place during the process of meiosis

producing eggs or sperm. That is, through crossing over events alleles in diploids may

come from either of the two parental chromosomes to form the haploid egg or sperm.

Although there is generally a monotonic relation between physical and recombination

distance, the relation is not a simple one.

1-3 Experimental Design

To cross between completely inbred lines, which differ in the trait of interest, offer

an ideal setting for detecting and mapping QTLs by marker-trait associations. The

reason is by doing that all F1s are genetically identical and shows complete linkage

disequilibrium for genes differing between the inbred lines. A number of designs have

been proposed to exploit these features. These designs can produce various mapping

populations that include backcross population, intercross population, doubled haploid

population and recombinant inbred lines population etc. The most inbred lines cross

design population are involved crop plants, however it is also applied to a number of

animal species, especially mice (reviewed by Frankel 1995).

Here we call the two different parental inbred lines (P1 and P2), the one is low (L)

line, and another one is high (H) line. The F1 individuals receive a copy of each

chromosome from each of the two parental lines, and so, wherever the parental lines

differ, they are heterozygous. All F1 individuals will be genetically identical and have

the genotype of HL at each locus. Almost all-experimental designs are starting from

the F1 status.

In a backcross design, The F1 individuals are crossed to one of the two parental

lines, for example, the high line. The backcross progeny, which may number from 100

to over 1000, receive one chromosome from the F1, and one from high parental line.

Thus, at each locus, they have genotype either HH or HL. As a result of crossing over

during meiosis, which is the process during the formation of the gametes, the

chromosome received from the F1 is a mosaic of the two parental chromosomes. At

−9−

Page 16: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

each locus, there is a half chance of receiving the allele from the high parental line

and a half chance of receiving the allele from the low parental line. The chromosome

received will be the alternation between stretches of L’s and H’s.

Another common experimental design used in plants is the intercross design. F2

population is made from selfing or sib mating F1 individuals. The F2 individuals

receive two sets of chromosomes from the F1 generation, each of which will be a

combination of parental chromosomes. Thus, at each locus, the F2 individuals will

have the genotypes of HH, HL or LL. The F2 population provides the most of genetic

information among different types of mapping populations (Lander et al. 1987), and is

relatively easy to be obtained.

A doubled haploid (DH) population is composed of many DH lines that are

usually developed from pollens on an F1 plant through anther culture and

chromosome doubling. The genotypes of the DH line’s individuals are homozygous

and are HH or LL in different locus along chromosome. DH populations are also

called permanent population because there will be no segregation in the further

generations. The advantage of the DH population is that the marker data can be used

repeatedly in different locations and years under various experimental designs.

However, the rates of pollens successfully turned into DH plants may vary with

genotypes of pollens, and this will cause segregation distortion and false linkage

between some marker loci.

A recombinant inbred lines (RIL) population is constructed by selfing or sib

mating individuals for many generations start from F2 by single seed descent approach

till almost all of the segregating loci come to be homozygous. Some RIL populations

have been developed in rice, maize and barley etc. recent year (Burr et al. 1988, Reiter

et al. 1992, and Li et al. 1995). The advantage of the RIL population is the genetic

distances are enlarged compared to those obtained from F2 or BC populations. The

reason is that many generations of selfing or sib mating increases the chance of

recombination. Therefore, It may useful for the increasing of the precision in QTL

mapping. However, it is not possible that all individuals in a RIL population are

−10−

Page 17: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

homogeneous at all segregating loci through the limited generations of selfing or sib

mating, which will decrease the efficiency for QTL mapping to some extent.

People use different experiment design population for different QTL mapping

research. In this dissertation, B1 and DH population will be used as chief example

because of its simplicity. At each locus in the genome, the progeny of B1 or DH

population have only two possible genotypes. However, the principles and results

obtained here are very easy to extend to other experiment design populations.

1-4 Models and Software

The QTLs information (numbers, positions, and effects etc.) of the experiment

population is unobservable. Through the experiment, people can only observe the trait

phenotype and marker information for each individual. The idea that genetic markers,

which tend to be transmitted together with specific values of the trait, are likely to be

close to a gene affecting that trait is the base for QTLs mapping. Therefore, the

genetic and statistic models are very important for describing the data and abstracting

the QTLs information from the data.

Genetic models are used for describing the organism’s genetic activity such as

recombination events and additive, dominant, or epistatic phenomena etc. For more

than two markers in a chromosome, the simplifying assumption is that recombination

between any two of them is independent from others’ recombination events. This

assumption is called no interference and the phenomenon of a single crossing over

between DNA strands can be considered as a Poisson-process. Therefore, Haldane’s

mapping function (Haldane 1919) can be used for describing the relationship between

recombination fraction r and genetic distance x.

Statistical models are the methods to obtain the QTLs information from the

experimental data through associate analysis and statistical calculation. Without the

appropriate statistical model, there is no way to retrieve the QTL information from the

experiment data, which includes the quantitative phenotypes and molecular markers.

Therefore the statistical model is critical for mapping QTL and a large number of new

models have been proposed since the 1980s (Weller 1986, Lander and Botstein 1989,

−11−

Page 18: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Haley and Knott 1992, Jansen 1992, 1993, Zeng 1993, 1994, Zhu 1998, Kao 1999

etc.).

We can classify these statistical models (methods) base on the number of markers

used or the techniques applied (Liu 1997, Hoeschele et al. 1991). The classification

according to marker numbers includes “single marker method”, “Flanking marker

methods” and “multiple marker methods”. It also can group the methods as “least

square methods”, “regression methods”, “maximum likelihood methods”, and “mixed

linear model approach methods” etc. In summary, these various methods differ from

simple to complicated, from detecting QTL-marker association to locating QTLs

position and estimation their effects, and from low resolution and power to high

resolution and power. In the later chapters, we will discuss these methods in more

details.

It is possible to use calculator to solve statistic problems when the data set is not

very large and the method is not too complicated. However, computer program is

usually used when people analysis the data set by statistic means. There is several

commercial software packages exist currently for statistical analysis purpose. These

general-purpose statistical software packages include SAS, SPSS, SPLUS, and

STATISTICA etc. It is likely to use these kinds of software to do the QTL mapping

analysis (Haley and Knott, 1992). However, the methods for QTL mapping are

usually complicated and not standardized. It is usually not efficient sometime even

impossible to map QTL by using these kinds of software package. Therefore, many

computer programs based on specific statistical methods have been developed for

QTL mapping purpose (Lander and Botstein 1989, Basten 1994,Wang 1999).

Base on the classical interval mapping principles, Mapmaker/QTL (Lander et al.

1987) is one of the popular QTL mapping software. This software has different

versions for PC, Mackintosh, and UNIX systems and it uses command-driven user

interface. It means that a series of commands should be executed for different stages

such as data input, doing various mapping functions and output the result.

QTL Cartographer (Basten et al. 1994) is another popular QTL mapping software

developed according to Zeng’s composite interval mapping method (Zeng 1994). The

−12−

Page 19: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

software also has different versions for PC and UNIX. However, the original software

uses several commands to fulfil the mapping tasks and sometimes it is confusing. We

have developed a windows-version of QTL Cartographer software that uses

user-friend interface and graphic result representation. It is certain that the new

version of the QTL Cartographer will be much easier to use and the software will be

described in more details later.

Other software is also available for QTL mapping, such as QTLSTAT (Liu and

Knapp, 1992), PGRI (Lu and Liu, 1995), MAPQTL (Van Ooijen and Maliepaard

1996) and Map Manager QTL (Manly et al 1996). Obvious, these programs are not as

popular as Mapmaker/QTL and QTL Cartographer. However, It is believed that new

method based QTL mapping software will be gradually accepted by genetic

researchers over the time. Advanced statistical method and good user interface should

be the most important facts for these kinds of software.

1-5 Simulation vs. Real Data

Statistical model is used for describing the real biological or genetic system.

Because this kind system is so complicated and some facts are unknown, it is

impossible to include all the facts (parameters) into a model. Therefore, it is

reasonable that there are several statistical models for QTL mapping research. Some

of these are quit complicated and some others maybe very simple. The properties of

an estimator for the statistical model can be obtained parametrically if the distribution

of the estimator is known and well characterized. However, in most models for QTL

mapping, it is usually too complicated to get the properties of the estimators

parametrically. Therefore, computer simulation is necessary for obtaining the

properties and checking the performance of the models and methods. This is no way

to examination a model’s performance by using real (experimental) data because the

true parameter is unknown. The advantage of using computer simulation data is that

we know the true parameters that can be used to compare with the estimator of the

model.

−13−

Page 20: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

The data for QTL mapping have two components, which include the map

information and the cross information. The map information data set contains

information of the marker positions and orders for each chromosome or linkage group

for an experimental organism. Figure 1-1 is the estimated genetic map for X

chromosome of the mouse species and the Table 1-1 is the map data in QTL

Cartographer format.

Figure 1-1. Markers information of X chromosome for mouse data. The numbers are distance incM between two markers and the labels are the marker’s names.

Tpm3-rs9 DXMit3 Hmg1-rs14 Hmg14-rs6 DXNds1 Rp118-rs17 Hmg1-rs13 DXMit97 DXMit109 DXMit48 Rps17-rs11 DXMit16 DXMit57

14.3 12.2 4.1 2.5 4.1 6.7 6.9 6.9 4.1 1.2 1.5 8.3

Table 1-1. Map data in QTL Cartographer format. 1No 2Labels 3Interval 4Position No Labels Interval Position

1 Tmp3-rs9 14.3 0.0 8 DXNds1 6.9 50.8 2 DXMit3 12.2 14.3 9 DXMit48 4.1 57.7 3 Hmg1-rs14 4.1 26.5 10 Rp118-rs17 1.2 61.8 4 DXMit97 2.5 30.6 11 Rps17-rs11 1.5 63.0 5 DXMit16 4.1 33.1 12 DXMit57 8.3 64.5 6 Hmg14-rs6 6.7 37.2 13 Hmg1-rs13 0.0 72.8 7 DXMit109 6.9 43.9

1Marker number, 2Marker name, 3Marker position (cM) in ‘interval’ format and 4Marker position in ‘position’ format.

The cross information includes the trait values and the marker genotypes for each

marker position of the individuals in an experiment population. Table 1-2 is the cross

information of mouse data set (partial), which is the Backcross population.

In the simulation study case, we can set the map information for a population and

producing (sample) the cross information of each individual from the population

according to various parameters such as QTL number, positions and distribution etc.

−14−

Page 21: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 1-2. First 6 individuals’ cross data in X chromosome of mouse species.

Markers on the X chromosome of the mouse species 1Ind 2BW 1 2 3 4 5 6 7 8 9 10 11 12 13

1 50.0 1 1 1 1 1 1 1 1 1 1 1 1 1 2 54.0 1 1 1 1 1 1 1 1 1 1 1 1 0 3 49.0 0 1 1 1 1 1 1 1 1 1 1 1 1 4 41.0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 36.0 1 1 1 1 1 1 1 1 1 1 1 1 1 6 48.0 0 0 0 0 0 0 0 0 0 0 0 0 0 M M M M M M M M M M M M M M M

1Individual number, 2One of the trait: Body Weight. 3Marker genotypes in each marker position: 1- AA and 0 – Aa.

1-6 Map Functions and Marker Analysis

1. Map Functions

To obtain the marker information, such as position and order, along the

chromosomes is very important for QTL mapping study of an organism. The state of a

specific genetic marker is called the marker genotype. There are two states of marker

genotype for Backcross (or DH) population. We can use 1 to represent MM genotype

and 0 for Mm (−1 for mm) on the marker M. Individuals sharing the same parents

may have different genotypes for the same genetic markers. These differences provide

the variation we need to statistically estimate the relationship between genetic markers

for the purpose of resolving their linear order across chromosomes of the organisms.

Recombination or crossover occurred during prophase I stage of meiosis is the

reason for individuals with same parents may have different marker genotypes. That is

during the production of gametes, an exchange of material between pairs of

chromosomes may occur. People can detect and record the variation or recombinants

by using laboratory techniques as marker genotype for each individual. There are

several facts about the marker genotype:

- The closer of the two markers, the less likely a recombination event is to occur.

- Markers that reside on different chromosome are unlinked.

−15−

Page 22: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

- Two markers that never experience a recombinant event between them are called

completely linked. They travel together during the meiosis process.

- If an even number of crossing over events occurs between two genetic markers,

this event is undetectable.

The number of odd crossovers (k) in an interval defined by two genetic markers

has a Poisson distribution with mean θ, that is:

Pr (recombination) = r =∑ −−−

−−

−=−

=++=k

k

eeeeeke )1(

21

2)(...)

!3!1(

!2θ

θθθθ

θ θθθ (1-1)

where θ is the number of map units M between two markers and here M stands for

Morgan and one M equal to 100 cM (center Morgan). After solving above equation

for θ gives Haldane’s map function:

)21ln(21 r−−=θ (1-2)

If let r equal to 0, the θ will be 0 too and it is the completely linkage case. If let r

equal to 0.5, the θ will become ∞ and this means markers are unlinked. This case

happened might be caused by the fact of the markers reside on different chromosomes

or also markers on the same chromosome, but far apart.

Table 1-3. Relationship between recombination frequency and map distance (M).

Recom. 0.0100 0.0500 0.1000 0.1500 0.2000 0.3000 0.4000 0.4900 0.4950 Haldane 0.0101 0.0527 0.1116 0.1783 0.2554 0.4581 0.8047 1.9560 2.3026 Kosambi 0.0100 0.0502 0.1014 0.1548 0.2118 0.3466 0.5493 1.1488 1.3233

If interference is taken into account, the Kosambi map function should be used:

−+

=rr

2121ln

41θ (1-3)

Table 1-3 is relationship between r and cM using different map function. It is easy

to conclude that comparing to Haldane function, as two markers become further apart,

the value of Kosambi map function decreased. However, for very small values of

recombination, both Haldane and Kosambi map function has similar value with

recombination frequency.

−16−

Page 23: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

2. Marker Order Analysis

It is necessary to estimate the probability of recombination between each pair of

genetic markers. The recombination occurs in the F1 gametes will be detectable in the

backcross (B1) generation. Assume we have two markers M and N, each having two

versions or alleles M1, M2 and N1, N2. The possible states or genotypes of the two

genetic markers are M1/M1, M1/M2 and N1/N1, N1/N2 for B1 population. If an

offspring’s genotype differs from the parental genotype at the markers, it means that a

recombination event is observed. From Table 1-4 we can know easily that the total

number of recombinant events is n2 + n3. Therefore the estimation of the

recombination frequency between marker M and N should be (n2+n3)/(n1+n2+n3+n4).

Maximizing likelihood method can also be used to solve this problem.

The likelihood function to describe this situation is . 4132 )1()( nnnn rCrrL ++ −=

To take the natural logarithm: )1ln()(ln)(ln)(ln 4132 rnnrnnCrL −++++=

To set the partial derivative with respect to r as 0 and solving the equation for r:

0)1(

)(ln 4132 =−+

−+

=∂

∂rnn

rnn

rrL And

4321

32ˆnnnn

nnr

++++

=

Table 1-4. The possible genotypes for Marker M and N of B1 population.

Marker genotypes N1 / N1 N1 / N2 M1 / M1 n1 n2 M1 / M2 n3 n4

It is very easy to use above formula for calculating the pair wise recombination

frequency between each pair of markers. By doing this calculations, we can decide the

linkage groups. A linkage group is a group of markers where each marker is linked (r

< 0.5) to at least one other marker. If a marker is not linked to any marker in a linkage

group, it does not belong to that group, and most likely belongs to some other linkage

group. In theory, the linkage group numbers should equal to chromosome numbers.

However, sometime the linkage group numbers is greater than chromosome numbers

because the sample variance and the limitation of the sample size. In other words,

−17−

Page 24: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

some of the recombination events are not detected by the experiment. In this case, to

increase the sample size or to do more experiments are necessary.

T

w

F

r

g

1

r

g

r

0

Figure 1-2. A linkage group structure for simulation study. Numbers abovethe markers are distances of the two markers in cM and under are maker

abl

Indi-dua

123456789

101112131415

T

ith

igu

eco

eno

-6

eco

I

rou

eco

.13

18−

2 3 5 1 4

e

v

m

m

m

20.3 17.6 7.8 21.0

1-5. Simulation data set of marker genotypes for a Backcross population.

Markers Markers i ls 1 2 3 4 5

Indivi -duals 1 2 3 4 5

AA AA AA Aa AA 16 AA AA Aa AA AA AA Aa Aa AA AA 17 Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa 18 Aa Aa Aa Aa Aa AA Aa Aa AA Aa 19 Aa Aa Aa AA Aa AA AA AA AA AA 20 Aa Aa Aa Aa Aa AA Aa AA AA AA 21 AA AA AA AA AA AA AA Aa Aa Aa 22 AA AA AA AA AA AA AA AA AA AA 23 Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa 24 Aa AA Aa AA Aa

AA AA AA Aa AA 25 Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa 26 AA AA AA AA AA

AA AA AA AA Aa 27 AA Aa Aa AA Aa AA Aa Aa AA AA 28 AA AA AA AA AA Aa Aa Aa Aa Aa 29 Aa AA Aa Aa Aa AA AA AA AA AA 30 AA AA AA AA AA

able 1-5 is a simulation data set that includes marker genotypes of B1 population

5 markers and 30 individuals produced from the linkage structure showed in

re 1-2. Here the Haldane map function has been used. The numbers of the

bination events between two markers, which are the counts of changing from

type AA to Aa or from genotype Aa to AA, are presenting in Table 1-6. Table

also includes the recombination frequencies that are the numbers of the

bination events divided by total individual number 30.

t is very important to know the makers’ orders and positions along the linkage

p or chromosome. We can estimate this information from the table of

bination frequencies (Table 1-6). From Table 1-6 we know the smallest value is

and they are the recombination frequencies between marker 1 and 5 or between

Page 25: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

marker 3 and 5. Here choice 3-5 as starting point (can choice 1-5 also). Then finding

the smallest value either from marker 3 side (2-3 is 0.17) or from marker 5 side (5-1 is

0.13) and the new order become 3-5-1. The next maker picked is 4 (1-4 is 0.17) and

the new order is 3-5-1-4. Therefore the final orders are 2-3-5-1-4.

After obtaining the markers order, it is easy to estimate the map distance between

markers by using recombination frequencies and appropriate map function. For

example, the recombination frequencies between marker 2 and 3 are 0.17 and the

distance will be 20.8 cM by using formula (1-2) to calculation. The final result is in

Figure 1-3.

Table 1-6. The count (frequencies) of recombination events.

Markers 1 2 3 4 5 1 0(0.00) 7(0.23) 6(0.20) 5(0.17) 4(0.13) 2 0(0.00) 5(0.17) 10(0.33) 7(0.23) 3 0(0.00) 9(0.30) 4(0.13) 4 0(0.00) 7(0.23) 5 0(0.00)

Co

the dis

consid

increas

Fro

estima

marker

very d

Proble

Figure 1-3. Estimated linkage group structure for the simulation data set.

2 0.17 3 0.13 5 0.13 1 0.17 4

m

t

er

e

m

to

n

i

m

20.8 15.1 15.1 20.8

paring Figure 1-3 to Figure 1-2, the markers order of estimation is correct but

ances between markers are not very accurate. It is quite reasonable for

ing such a small sample size (only have 30 individuals). As the sample size

d, the estimation will be more precise.

this simple example, it seems quite easy to obtain the markers order and the

rs of the marker distance by counting the recombination events. However, as

umber increase, the problem of ordering a set of genetic markers will become

fficult. This problem is equivalent to the famous “Travelling Salesman

”. One of the criteria for comparing two different orders is to minimize the

−19−

Page 26: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Sum of Adjacent Recombination Fractions (SAR). For above example, the SAR

value for the final order 2-3-5-1-4 is 0.17+0.13+0.13+0.17 = 0.60. The other criterion

includes SAL standards for Sum of Adjacent Likelihood Functions.

The main problem for ordering the markers is not the criterion but the

computation time. As the marker number increased, the numbers of possible orders

will quickly become unmanageable by means of computation. Therefore, the only

way to solve this problem is to find the better (not necessary the best) order through

some kind of searching procedures. Several methods have been proposed since 1980.

These methods include Branch and Bound (Thompson, 1984), Simulated Annealing

(Weeks and Lange, 1987), Seriation (Buetow and Chakravarti, 1987a, 1987b), and

Rapid Chain Delineation (Doerge, 1993) etc. There are numbers of software available

for ordering markers and estimating distance between markers, MAPMAKER

(Lander etc 1987) is one of it.

3. Marker Segregation Analysis

It is also important to do the Mendelian segregation test for each marker to test the

segregation distortion of the markers. By expectation, the segregation ratio should be

1:1 for population of BC, DH, or RIL and 1:2:1 for the intercross population. In

backcross population, to across between A/A and A/a produces the zygotes AA and

Aa with the same expected number of n/2. Table 1-7 shows the expected number and

observed number for above simulation data set as showed in Table 1-5. A test statistic

can be constructed by using χ2 under the null hypothesis, p(AA) = p(Aa) = 0.5

(Mendelian Segregation), as showed in formula (1-4). In this example, the individual

number n = 30 and n1 and n2 is observed number for genotype AA and Aa in each

marker position.

∑ −=

−= 2

1

221

22 ~)(

.#).#.#( χχ

nnn

ExpExpObs

(1-4)

Rejecting H0 means the deviation from Mendelian segregation is significant and

this phenomenon is called segregation distortion. Segregation distortion can be caused

by sample variation. However sometimes it is caused by genetic reason such as the

−20−

Page 27: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

selection force on different types of zygotes is different. Significant segregation

distortion can bias estimation of recombination frequency (distance) between markers.

It can also reduce the power to identify QTLs and bias the estimation of QTLs

positions and effects.

Table 1-7. Marker segregation analysis for the simulation data set.

Markers Marker 1 Marker 2 Marker 3 Marker 4 Marker 5 Genotypes AA Aa AA Aa AA Aa AA Aa AA Aa 1Frequency under H0 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½ Expected number 15 15 15 15 15 15 15 15 15 15 Observed number 18 12 15 15 12 18 17 13 14 16 χ2 value 1.20 0.00 1.20 0.53 0.13 p-value >0.250 >0.995 >0.250 >0.250 >0.500 1H0: null hypothesis.

1-7 Purpose of This Research

The purpose of the QTL mapping practice is to identify or locate various QTLs

along the chromosomes for a species through special experimental design and genetic

markers information. The QTLs information such as number, locations, and effects

can help geneticist and breeders to improve the quality and quantity of the plants or

animals. However, the fundamental of the QTL mapping methods is based on statistic

principles. It is important to understand the statistic principles before using a

particular QTL mapping method to analysis the experimental data set. Moreover, it is

also useful by comparing different QTL mapping methods to understand the

performances of the various methods under difference circumstance. This kind of

study can help users to choose the appropriate QTL mapping method according to

their experiment requirements and provide the basis for understanding the result after

QTL mapping analysis.

In this research, a large scale of computer simulation has been conducted for

studying and comparing the performances of the major QTL mapping methods. These

methods include Interval Mapping method, Composite Interval method, and Mixed-

model based CIM mapping method. We have also conducted a series of simulation

researches for identifying the model selection criteria that are the critical part for the

multiple QTL mapping methods. The computer software accompany with a particular

−21−

Page 28: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

QTL mapping method is very important because the QTL mapping method is usually

too complicate to use without the computer software. However, the most QTL

mapping software existed are using command drive system as its interface and it is

usually not very convenience to use. We have developed a QTL mapping software

with user friend interface and result visualization ability. The software is called

“Windows QTL Cartographer” (Wang et al. 1999) that has been posted on the Internet

and has many users.

2. Review of Major QTL Mapping Methods

2-1 One Marker Method

One marker method is based on the simple idea that if there is an association

between marker type and trait value, it is likely that a QTL locus is close to that

marker locus. The approach has been applied in many studies of QTLs for various

organisms such as Drosophila (Thoday, 1961), maize (Edwards et al, 1987) and

tomato (Weller, Soller and Brody, 1988).

Table 2-1. Trait mean and distribution for various populations.

Population Genotype Mean Distribution P1

1MQ / MQ µ1 = µ + a N( µ1, σ2) P2 mq / mq µ2 = µ − a N( µ2, σ2) F1 MQ / mq µ12 = µ + d N( µ12, σ2)

1M or m means marker and Q or q indicates QTL.

Table 2-2. Frequencies and mean effects for various marker-QTL genotypes in B1 population.

Genotype MQ / MQ MQ / Mq MQ / mQ MQ / mq Frequency 1(1−r)/2 r/2 r/2 (1−r)/2

Mean effect µ + a µ + d µ + a µ + d 1 r is the recombination frequency between marker and QTL.

1. Statistic Bases for One Marker Method

Suppose that two parental inbred lines differ sufficiently in the quantitative trait

that we are convinced there are QTLs responsible for the trait difference. Assuming

the trait values of the two parental lines and F1 population are normally distributed as

−22−

Page 29: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

showed in Table 2-1. The frequencies and mean effects of various marker-QTL

genotypes for B1 population (P1 × F1 = MQ / MQ × MQ / mq) are showed in Table

2-2. Although we cannot observe the QTL genotypes, but the marker genotypes are

observable and the mean effect for various marker genotypes in B1 population is

showed in Table 2-3.

If only one QTL is linked to marker M, the mean difference between the two

marker types in B1 population is showed in formula (2-1). If ignoring epistatic effect,

the mean different effect for the situation of multiple QTLs linked to the marker M is

showed in formula (2-2).

Table 2-3. Mean effects for various marker genotypes in B1 population.

Marker Types QQ Qq Mean Effect 1Frequency 1 – r r

MM Effect µ + a µ + d µMM = (1–r)(µ + a) + r(µ + d) Frequency r 1-r

Mm Effect µ + a µ + d µMm = r(µ + a) + (1-r)(µ + d) 1The frequency of various QTL genotypes, r is recombination frequency between Q and M.

µMM - µMm = [(1−r) (µ+a)+r(µ+d)] – [r(µ+a)+(1−r)(µ + d) ] = (1−2r)(a−d) (2-1)

( )(∑=

−−=−m

kkkikMmMM dar

1

21µµ ) (2-2)

2. The -test Method t

From formula (2-1), it is easy to know that if the difference in means of the two

marker genotypes is not zero, it can be inferred that , since it is known that

. Therefore, we can use the t-test statistic (formula 2-3) to test for

linkage between marker M and QTL Q.

5.0≠r

0)( ≠−= daδ

Hypotheses: H0: H1: − MmMM µµ 0=− MmMM µµ 0≠

t-test statistic: )2(~11

21

21

2

−+

+

−= nnt

nnst

p

MmMM µµ (2-3)

Here and represent the number of individuals belong to ‘MM’ and ‘Mm’

genotypic marker classes, respectively.

1n 2n

−23−

Page 30: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

2

)1()1(

21

222

2112

−+−+−

=nn

snsns p

The is the estimate of the variance for ‘MM’ marker class individuals and the

is the estimate of the variance for ‘Mm’ marker class individuals.

21s

22s

3. Likelihood Ratio Test Method

For a normal distribution variable Y , the likelihood for the

parameters is

),(~ 2σµN

2

2

2)(

2

2

2

1),( σµ

πσσµ

−−

=Y

eL .

The phenotypic distribution for B1 population is a mixture normal as following:

),(),()1(~ 212

21 σµσµ rNNrY +− For MM marker genotype

),()1(),(~ 212

21 σµσµ NrrNY −+ For Mm marker genotype

The likelihood function of any one marker for the backcross scenario is showed in

formula (2-4). The hypothesis of no linkage can be tested with likelihood ratio

statistic.

Hypotheses: H0: r = 0.5 Ha: r<0.5

Likelihood ratio test statistic:

=<

=)5.0,ˆ,ˆ,ˆ()5.0,ˆ,ˆ,ˆ(

ln2 21210

2121

rLrLa

σµµσµµ

λ (2-4)

The estimates of will be different for r being estimated or set to 0.5.

In practice, a set of different values of r is tried and the LR score demonstrates how

much more likely the data are if there was QTL present as compared to the situation

when there is no QTL present. Then, the peak of the LR score can be used to compare

with the threshold value, which is derived according to the significance levels.

2121 ,, σµµ

L(µ1, µ12, σ2, r; x1, …, xn, y1, …, yn) =

),()1(),(),(),()1( 212

21

1 1

212

21

1 2

σµσµσµσµ NrrNrNNrn

i

n

i

−+++−∏ ∏= =

−24−

Page 31: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

4. Simple Regression Method

The simple regression model is Y and i = 1, 2,…, n is the

individual index. is the overall mean and is the additive effect of the QTL

when an allele substitution is made from the recurrent parent to the non-recurrent

parent. is the indicate variable, which has the value ½ for carrying non-recurrent

marker (M) and –½ for carrying recurrent marker (m) by the individual.

iii X εββ ++= 10

1β0β

iX

Table 2-4. Possible outcomes for one marker one QTL situation.

Genotypes Frequency X – value Y – value MQ / MQ 1(1 – r ) / 2 ½ 2µ1 MQ / mQ r / 2 –½ µ1 MQ / Mq r / 2 ½ 3µ12 MQ / mq (1 – r) / 2 –½ µ12

1r is recombination frequency between the marker and the QTL. 2µ1 is mean value of P1 genotype. 3µ12 is mean value of F1 genotype.

From Table 2-4, we have:

E(X) = [(1–r)/2](1/2) + (r/2)( –1/2) + (r/2)(1/2) + [(1–r)/2]( –1/2) = 0

E(X2) = [(1–r)/2](1/2)2 + (r/2)( –1/2)2 + (r/2)(1/2)2 + [(1–r)/2]( –/2)2 = ¼

[ ] =−= )()( 222 XEXEXσ ¼

== )( XYEXYσ

[(1–r)/2](1/2)(µ1)+(r/2)( –1/2)(µ1)+(r/2)(1/2)(µ12)+[(1–r/2)]( –1/2) (µ12)

= ¼ (1 – 2r)(µ1 – µ12) = ¼ (1 – 2r)(a – d)

))(21(21 darMQX

XY −−==σ

σβ

Therefore to test the slope of the regression model to see it is zero or not has the

same meaning as a t-test introduced above.

2-2 Interval Mapping Method

When only one marker has being used in QTL mapping, the effects are

underestimated and the position cannot be determined. In order to overcome those

drawbacks, Lander and Botstein (1989) introduced the interval mapping as a

−25−

Page 32: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

systematical way to scan the whole genome for evidence of QTL. Interval mapping

method is an extension of one marker analysis by using two flanking markers to

construct an interval for searching a putative QTL within the interval. The concept of

using complete marker linkage maps for genomic scanning of QTL is important and

the idea of viewing QTL genotypes as missing data and using a mixture model for

maximum likelihood analysis is influential.

The basic idea for interval mapping is simple. We first consider an interval

between two observable markers M and N, each having two possible alleles for

Backcross population. The genetic distance or recombination frequency between the

two markers has been previously estimated. A map function (either Haldane or

Kosambi) is utilized to translate from recombination frequency to distance or vice

visa. To calculate a LOD score at each increment (walking step) in the interval and

finally to get the profile of LOD score for whole genome. When a peak has exceeded

the threshold value, we declare that a QTL have been found at that location.

1. Conditional Probabilities of QTL Genotypes

The basic element upon which the formal theory of QTL mapping is built is the

probability of the QTL genotype conditional on the observed marker genotypes. From

the definition of a conditional probability, we have

)Pr()Pr()|Pr(

MNQMNMNQ = (2-5)

The joint and marginal probabilities, Pr(QMN) and Pr(MN), are functions of the

experimental design and the linkage map. When computing joint probabilities

involving more than two loci, one must also account for recombination interference

between loci. When considering a single QTL flanked by two markers M and N, the

gamete frequencies depend on three parameters: the recombination frequency r12

between markers, the recombination frequency r1 between marker M and the QTL,

and the recombination frequency r2 between the QTL and marker N.

−26−

Page 33: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 2-5. The probability of the QTL genotype condition on marker classes in B1 population.

Mk Class 1Prob1 Genotype 2Prob2 3Conditional (Prob2 / Prob1) MQN / MQN (1−r1)(1−r2)/2 Pr(QQ) = [(1−r1)(1−r2)] / (1– r12) ≈ 1

MN / MN (1 – r12)/2 MQN / MqN r1 r2 / 2 Pr(Qq) = r1 r2 / (1– r12) ≈ 0 MQN / MQn (1−r1) r2 / 2 Pr(QQ) = (1−r1) r2 / r12 ≈ 1 - p

MN / Mn r12 / 2 MQN / Mqn r1(1−r2)/2 Pr(Qq) = r1(1−r2) / r12 ≈ r1 / r12 = p MQN / mQN r1(1−r2)/2 Pr(QQ) = r1(1−r2) / r12 ≈ r1 / r12 = p

MN / mN r12 / 2 MQN / mqN (1−r1) r2 / 2 Pr(Qq) = (1−r1) r2 / r12 ≈ 1 - p MQN / mQn r1 r2 / 2 Pr(QQ) = r1 r2 / (1– r12) ≈ 0

MN / mn (1 – r12)/2 MQN / mqn (1−r1)(1−r2)/2 Pr(Qq) = [(1−r1)(1−r2)] / (1– r12) ≈ 1 1Probability of the marker class. 2Probability of the marker – QTL genotype. 3Conditional probability for the QTL genotype according to formula (2-5), here p equal to r1 / r12.

Under the assumption of no interference assumption (Haldane), the relationship

between r12 and r1, r2 will be , while under complete

interference (Kosambi). When r

212112 2 rrrrr −+= 2112 rrr +=

12 is small, gamete frequencies are essentially

identical under either interference assumption. Because the QTL is unknown, we

can only use the observable marker genotype to infer the QTL genotype. Table 2-5

shows the probability of the QTL genotype according to the two flank markers

genotypes.

2. Genetic Model

For a backcross population, to analyse a QTL located on an interval flanked by

marker M and N, the interval mapping method assumes the following linear model.

jjj exby ++= **µ j = 1, 2, …, n (2-6)

where = The effect of the putative QTL *b

=QqisgenotypeQTLtheifQQisgenotypeQTLtheif

x j 01*

),0(~ 2σNe j

In the model, the variable x* is used for indicating the QTL genotype which are

unobserved. However, the probabilities of possible QTL genotypes can be inferred by

given the genotypes of two flank markers as showed in Table 2-5 and the summary is

showed in Table 2-6. For backcross population, we can define

−27−

Page 34: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

.1,0),,|(Pr * === kpNMkxobp jkj

where 121 rrp = and the approximation is obtained by assuming that the double

recombination events can be ignored.

Table 2-6. The probabilities of possible QTL genotypes condition on marker classes.

QTL Genotype Maker Classes Numbers QQ(1) Qq(0)

MN / MN n1 11

)1)(1(

12

21 ≈−

−−r

rr 01

))((

12

21 ≈− r

rr

MN / Mn n2 pr

rr−≈

−−

11

))(1(

12

21 pr

rr≈

−−

12

21

1)1)((

MN / mN n3 pr

rr≈

−−

12

21

1)1)(( p

rrr

−≈−

−1

1))(1(

12

21

MN / mn n4 01

))((

12

21 ≈− r

rr 11

)1)(1(

12

21 ≈−

−−r

rr

3. Maximum Likelihood Analysis

For model (2-6), there are two possible QTL genotypes each of that can be true

with a certain probability. The distribution of the model is a mixture normal

distribution and the likelihood function can be defined as

∏=

−+

−−=

n

j

jj

jj

yp

byppbL

10

*

12* ),,,(

σµ

φσµ

φσµ (2-7)

where ( ) ( 22

21 zez −=π

φ ) is the standard normal density function.

In likelihood function (2-7), the parameters include:

µ - the mean of the model

*b - the effect of the putative QTL

121 rrp = - the position of the putative QTL related to the flank markers

2σ - residual variance of the model

The data of the analysis include:

jy - Phenotypic value of a quantitative trait for each individual

Genotypes of markers for each individual that contribute to the analysis of

jkp , k = 1, 2; j = 1,2, …, n

−28−

Page 35: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

The maximum likelihood analysis of a mixture model is usually through an

Expectation-Maximization algorithm. EM is an iterative procedure and the E-step for

likelihood function (2-7) is to calculate:

[ ]( )[ ]( ) [ ]( )σµφσµφ

σµφ−+−−

−−=

jjjj

jjj ypbyp

bypP

0*

1

*1

The M-step is to calculate:

( ) nbPyn

jjj∑

=

−=1

*ˆµ

( ) ∑∑==

−=n

jj

n

jjj PPyb

11

*ˆ µ

( )[ ]∑=

−−=n

jjj bPy

n 1

2*22 1ˆ µσ

This process is iterated until convergence of estimates.

4. Likelihood Ratio Test

The test statistic can be constructed using a likelihood ratio in LOD (likelihood of

odds) score:

)ˆ,ˆ,ˆ()ˆ,0,ˆ(log

2*

2*

10σµσµ

bLbLLOD =

−=

Under the hypotheses

0:0: *1

*0 ≠= bHandbH

By assuming that the putative QTL is located at the position indicated

by 121 rrp =

2* ˆ,ˆ, σb

, we can get the maximum likelihood estimates of under H2* ,, σµ b 1 as

and under Hµ 0 as with constrained to zero. That the LOD score test

is essentially the same test as the usual likelihood ratio test:

2ˆ,ˆ σµ *b

)ˆ,ˆ,ˆ()ˆ,0,ˆ(ln2

2*

2*

σµσµ

bLbLLR =

−=

And we have the relationship between LOD value and LR value as

−29−

Page 36: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

( ) LRLReLOD 217.0log21

10 ==

The test can be performed at any position covered by markers and thus the method

creates a systematic strategy of searching for QTL. The amount of support for a QTL

at a particular map position is often displayed graphically through the use of

likelihood maps profile, which plots the likelihood ratio test statistic as a function of

map position of the putative QTL. If the LOD score at a region exceeds a pre-defined

critical threshold, a QTL is indicated at the neighbourhood of the maximum of the

LOD score with the width of the neighbourhood defined by one or two LOD support

interval (Lander and Botstein 1989). By the property of the maximum likelihood

analysis, the estimates of locations and effects of QTL are asymptotically unbiased if

the assumption that there is at most one QTL on a chromosome is true.

The test statistic LR for a given position is expected to be asymptotically

chi-square distributed with one degree of freedom under the null hypothesis for the

backcross design and with two degree of freedom for the F2 design (Lander and

Botstein 1989, Van Ooijen 1992, Zeng 1994). However, because the test is usually

performed in the whole genome, there is a multiple testing problem. The distribution

of the maximum LR or LOD score over the whole genome under the null hypothesis

becomes very complicated. An asymptotic theory, which is based on an

Orenstein-Uhlenbeck diffusion process for determining appropriate genome-wise

critical values, has been developed by Lander and Botstein (1989), Feingold et al.

(1993) and Lander and Schork (1994). Lander and Botstein (1989) suggested that a

typical LOD score threshold should be between 2 and 3 to ensure a 5% overall false

positive error for detecting QTL.

2-3 Composite Interval Mapping

For interval mapping method, the estimated locations and effects of QTL tend to

be asymptotically unbiased if there is only one segregating QTL on a chromosome.

However, if there is more than one QTL on a chromosome, the test statistic at the

position being tested will be affected by all those QTL and the estimated positions and

−30−

Page 37: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

effects of QTL identified by this method are likely to be biased. ‘Ghost QTL problem’.

One of the reasons for these shortcomings is that the test used in interval mapping

method is not an interval test. An interval test is that the effect of the QTL within a

defined interval should be independent of the effects of QTL outside the region.

Otherwise, even when there is no QTL within an interval, the likelihood profile on the

interval can still exceed the threshold significantly if there is a QTL at some nearby

region on same chromosome.

In order to overcome the shortcoming of interval mapping method, Zeng (1994)

proposed an improved method called composite interval mapping by combining

interval mapping with multiple regression analysis. Let us first review some relevant

theory in multiple regression analysis for QTL mapping (Zeng 1993).

1. Properties of Multiple Regression Analysis

Due to the linear structures of locations of genes on chromosomes, multiple

regression analysis has a very important property. That is the partial regression

coefficient of a trait on a marker is expected to depend only on those QTLs that are

located on the interval bracketed by the two neighbouring markers. It is independent

of any other QTL outsides the region if there is no crossing over interference and no

epistasis. However, interference and epistasis will introduce non-linearity in the

model.

Suppose we regression trait value y on t markers observed in B1 population:

∑=

++=t

kjjkkj exby

1

µ

where is the indicate value (1 or 0) of the th marker in the th individual,

and is the partial regression coefficient of the phenotype y on the th marker

conditional on all other markers. can also be denoted as and denotes a

set which includes all markers except the th marker.

jkx k j

kb k

kskb skykb .

k

−31−

Page 38: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Since takes a value of 1 or 0 with equal probability, the variance of the th

marker in the population is

jkx k

412 =kσ . It is easy to show that the covariance between

the th and th markers is i k 4)21( ikik r−=σ and is the recombination

frequency between marker i and marker k. The covariance between the trait value y

and the th maker is:

ikr

k

4)21(1∑=

−=m

uuukyk r δσ

where is u th QTL effect. uδ

With these basic equations, any conditional variance and covariance can be

derived. The variance of marker k conditional on marker i is:

[ ] ( )ikikikiikkik rrr −=−−=−= 14)21(1/ 22222. δδδδ

Because without interference, we have:

( ) ( )( ) kliorilkorderforrrr klilik 212121 −−=−

The covariance between markers i and k conditional on marker l is:

( ) ( )( )[ ]

( )( )( )( )

−−−−=

−−−−=−=

kilorlikorderforrrrlkioriklorderforrrrkliorilkorderfor

rrr

ikilil

ikklkl

kliliklkliliklik

211211

04212121/ 2

. σσσσσ

The above result shows that conditional on an intermediate marker, the covariance

between two flanking markers is expected to zero and from this property Zeng (1993)

shows:

( ) ( )( )( )( ) ( )( )

( ) ( )( )( )( ) ( )( ) u

kuk kuk kkkk

kukukuu

kkkk

ukukukskyk a

rrrrr

arr

rrrb ∑ ∑

≤<− +≤< ++

++

−−

−−

−−+

−−=

1 1 11

11

11

11. 1

2111

211

where the first summation is for all QTLs located between marker k-1 and k and the

second summation is for all QTLs located between marker k and k+1. This is a very

desirable property that the regression coefficient depends only on those QTLs that are

located between marker k-1 and k+1. That was the property that can be used to create

an interval test in which we can test whether there are QTLs within a marker interval.

−32−

Page 39: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

There are also other properties of the multiple regression that have direct

relevance to QTL mapping. These are summarize as follows:

Conditioning on unlinked markers in the multiple regression analysis will reduce

the sampling variance of the test statistic by controlling some residual genetic

variation and thus will increase the power of QTL mapping.

Conditioning on linked markers in the multiple regression analysis will reduce the

chance of interference of possible multiple linked QTL on hypothesis testing and

parameter estimation, but with a possible increase of sampling variance.

Two sample partial regression coefficients of the trait value on two markers in a

multiple regression analysis are generally uncorrelated unless the two markers are

adjacent markers.

2. Genetic Model

Composite interval mapping is an extension of interval mapping with some

selected markers also fitted in the model as cofactors to control the genetic variation

of other possibly linked or unlinked QTL. To test for a QTL on an interval between

adjacent markers Mi and Mi+1, the model will be:

∑ +++=k

jjkkjj exbxby **µ (2-8)

where refers to the putative QTL and refers to those markers selected for

genetic background control. Appropriate selection of markers as cofactors is

important and will discuss later.

*jx jkx

3. Likelihood Analysis

The likelihood function of formula (2-8) is specified as:

∏=

−+

−−=

n

j

jjj

jjj

BXyp

bBXypBbL

10

*

12* ),,(

σφ

σφσ

where and the maximum likelihood estimates of the various

parameters are given below (use EM algorithm):

∑+=k

jkkj xbBX µ

−33−

Page 40: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

( )[ ]( )[ ] ( )[ ]σφσφ

σφBXypbBXyp

bBXypP

jjjjjj

jjjj −+−−

−−=

0*

1

*1

( ) ( )∑∑==

′−=−=n

jjj

n

jjj cPXBYPPBXyb

11

* ˆˆ

where c and the prime denotes matrix transposition. { } { }∑=

××===

n

jnjnjj PPyYP

111,,,

( ) ( )*1 ˆˆ bPYXXXB −′′= −

( ) ( ) cbBXYBXYn 2*2 ˆˆˆ −−′

−=σ

4. Hypothesis Test

The hypotheses to be tested are H0: b* = 0 and H1: b* ≠ 0. The likelihood function

under null hypothesis is:

( ) ∏=

−==

n

j

jj BXyBbL

1

2* ,,0σ

φσ

The maximum likelihood estimates of B and σ are:

( ) nBXYBXYandYXXXB

−′

−=′′= − ˆˆˆˆ 21 σ

The likelihood ratio (LR) test statistic is:

( )2*

2*

ˆ,ˆ,ˆ

ˆ,ˆ,0ln2

σ

σ

BbL

BbLLR

=

−=

Like interval mapping method, the test can be performed at any position in a

genome covered by markers and it is easy to perform a systematic search for QTLs in

a genome. As the test statistic is almost independent for each interval, a test on each

interval is more likely to test for a single QTL only.

5. Marker Selection

The main difficulty to use composite interval mapping method is to answer the

question which markers should be added into the model before searching the QTL.

There is no simple solution for this question because the answer depends on the

−34−

Page 41: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

number and positions of underlying QTLs and the information is not available before

QTL mapping. Too few markers selected may not achieve the purpose of reducing the

most residual genetic variation and too many markers selected may reduce the power

of the analysis.

The practical implement of the marker selection in QTL cartographer software has

two steps. In the first step, the selecting procedures such as forward, backward, or

stepwise regression selects markers that are significantly associated with the trait.

In the second step, a testing window is defined for blocking the markers inside the

window is used for the test model. The window is constructed by use a parameter W

pn

s

that is the distance (cM) between the testing interval (one for each direction) and the

nearest marker picked for the model. Then, those selected n markers that are

outside of the testing window are also fitted into the model to reduce the residual

variance.

p

The different conditions of the composite interval mapping can be created by

changing the values of and Wpn s. Generally should be much smaller than n,

not exceeding

pn

n2 (Jansen 1994), or alternatively it can be determined

automatically by F-to-enter or F-to-drop criterion in the forward or backward

regression analysis. Ws should at least 10 or 15 cM depending on sample size.

2-4 Mixed Linear Model Approach

As introduced above, CIM method is based on fixed multiple regression models.

Zhu (1998) suggested a new methodology for mapping QTL by using mixed linear

model approach that was called mixed-model based composite interval mapping

(MCIM). Unlike CIM method, MCIM method consider the marker effects as random

effects and by doing so, the obvious advantage is that the model can be extended

easily for more complicated QTL mapping situation, such as QTL by environment

interaction and QTL epistasis etc.

−35−

Page 42: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

1. Genetic Model

For B1 or DH population, to analyze a QTL located on an interval flanked by marker

and , the MCIM method assumes the following model. −iM +iM

∑ +++=k

jkMkjMjAj euaxy εµ )()()(

where is the trait value for individual j, is the population mean, is the

additive effect of the putative QTL and is coefficient for additive effect,

is the random effect of marker k with its coefficient , and is the random

residual effect.

jy µ a

)( jAx )(kMe

)(kjMu jε

The model can also be expressed as the mixed linear model formula as follows:

=

=

=+=

+=++=

2

1

'2'2

2

1

),(~

uMuueMMMM

uuuMM

URUIURUV

VXbNeUXbeeUXby

σσ

ε

(2-9)

where V is model’s variance, is the variance component of markers and

is the residual variance component. is known symmetric matrix

of correlation coefficients and is identical matrix.

221 Mσσ =

IR =2

222 eσσ = MRR =1

[ ] ( )mfandfR ffM ,...,2,1, '' == ρ

In above formula, m is the number of markers selected for background control and

is the correlation coefficient between marker and marker .

is the recombination frequency between marker loci f and f ’.

'' 21 ffff r−=ρ

'ffr

fMe'f

Me

2. Likelihood Analysis

The log value of likelihood function for formula (2-9) is specified as:

( )( ) ( ) ( ) ( ) ( XbyVXbyVnVblVbL −′−−−−== −1

21ln

212ln

2,,log π ) (2-10)

where the model’s variance V can be calculated according to formula (2-9) and the

variance component can be estimated by MINQUE-1 (Rao 1971; Rao 1997) or 2uσ

−36−

Page 43: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

REML method (Hartley and Rao 1967, Searle, 1970).

The estimations of QTL effects b was obtained by the formula:

( ) yVXXVXb 111 ''ˆ −−−= (2-11)

3. Hypothesis Test

Like IM and CIM methods, to search putative QTL within two flanking markers

and for the whole genome by setting a prior value for recombination

frequency between marker and putative QTL locus Q. The likelihood

ratio statistic (LR) can be calculated by:

−iM +iM

QM ir

−ˆ −iM

( ) ( )5.0,ˆ,ˆ2ˆ,ˆ,ˆ2 01 =−=−− QMQM ii

rVblrVblLR (2-12)

Therefore, the LR profile for whole genome can be plotted and the QTLs can be

located according to the LR profile.

4. A Model for GE Interaction

If QTL mapping experiments was conducted in several environments for

individuals sampled from the same DH population, QTL genetic main effects and GE

interaction effects can be evaluated by MCIM method. When experiment data

obtained from different environments need to be analyzed, environment effects are

usually treated as random effects. The additive model (2-9) can be expended to

include interaction effects for additive, replication, and marker effects.

The trait value measured on the th individual in the th environment and th

replication can be expressed as:

j h b

∑ ∑ +++

++++=

f lhjkkBkBlhMElhjMEfMfjM

hAEhjAEhEhjEjAhjk

eeueueu

eueuaxy

)()()()()()(

)()()()()(µ

where is the population mean, a is the additive main effect for searching QTL and

is coefficient for genetic main additive effect, is environment effect with

its coefficient , is the additive by environment interaction effect with its

µ

)( jAx )(hjEu

)(hjEu )(hjAEe

−37−

Page 44: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

coefficient , is the random main effect over environments for the th

marker genotype with its coefficient . is the marker by environments

interaction effect with its coefficient . is the replication effect with its

coefficient and e is the random residual effect.

)(hjAEu

)(kBu

∑=

+

+

+

2

'

6

1

uuu

EE

uuu

EE

URU

U

eU

eU

σ

)( fMe

hjk

+

(~

AEAE

AEAE

UU

XbN

e

f

)( fjMu

)(lhjMEu

+

+

'M

ME

U

Ue

σ

)(lhMEe

)(kBe

+

MEME

B

RU

ee ε

2

B

'

,

AE

( RU uu1

11 −− XV

+ iki xa

ky

jp

) yQ'

'( −VX

jkj xa

ijaa

1

m

Mu (f

fk fMe )()

The model can also be expressed as the mixed linear model formula

∑=

=

+++=

=

++=

6

1

'

2'2'222

)

uu

eBBBMEMEMMME

MEMM

IUUURUUV

VXb

UeUUXby

σ

σσσσ (2-13)

By using model (2-13) and formula (2-10), QTLs can be searched (according to

the LR values) by mixed linear model approaches after using data for all individuals

across multiple environments and replications. When a QTL is found, its position on

the chromosome and genetic main effects (formula 2-11) as well as GE interaction

effects was obtainable by the mixed linear model approaches. GE interaction effects

can be predicted by BLUP method (Zhu 1999).

e uu22ˆ σ=

1') −+−= VXXVQ

5. A Model for QTL Epistasis

The following model can be used for two-way searching the QTLs with digenetic

epistatic effects when the population is B1 or DH.

∑ ∑= =

+++++=mm

hhMMhkMMijkijk eeuway

1 1)()( εµ

where the is the trait value of individual k and is the mean of the population.

and a are the additive effects for the two putative QTLs i and j at two testing

point and . is the digenetic additive by additive epistatic effect between

µ

ia j

ip

−38−

Page 45: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

the QTLs i and j. and are the coefficients for the effects of QTL i,

QTL j, and QTL epistasis respectively. is the random effect of marker f and

is the random effect of the two-locus marker interaction between two markers.

and u are the coefficients. is the random residual effect.

jkik xx ,

+ 2MM

MM eU

σ

ijw

=

MMR

ε

)( fMe

εe

∑=

+

+3

1

MM

u

σ

)(hMMe

)( fkMu

=

=

V

y

R

)(hkMM

+

MM

MM

URU

eU

[ ]'hhρ

+

'M

1(

21(

''r ji −

−−

1(

2rcd

21( r−

41

ip

''_ jiij =ρρ

The model can also be expressed as following mixed linear model format:

∑=

=

+

3

1

'22'2

' ),(~

uuuuueMMM

uuuMM

URUIUU

VXbNURUXbeXb

σσ (2-14)

where is known symmetric matrix of correlation coefficients for marker

interaction:

MMR

(h, h’ = 1, 2, …, mm) MM =

and '',))

)21)()1)(

''

''' jiji

rrr

rr

jiijij

jiijabhh <<

−−=

The set of (i, j, i’, j’) equal to set of (a, b, c, d) and a < b < c < d in the whole

genome base.

The basic idea of mapping QTLs with marginal and two-ways epistatic effects

is through the two-dimensional searching along the whole genome. For each of the

two testing points and within two intervals each flanked by two markers, the

LR value can be calculated by using formula (2-12) and the QTLs can be located by

analysing the LR profile.

jp

2-5 Multiple Interval Mapping

Multiple interval mapping (MIM) is a multiple QTL oriented method

combining QTL mapping analysis with the analysis of genetic architecture of

quantitative traits through a search algorithm to search for number, positions, effects

and interaction of significant QTL simultaneously. For m putative QTL, the multiple

interval mapping model for a B1 population is defined by:

−39−

Page 46: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

( ) i

m

r

t

srisirrsirri exxxy ∑ ∑

= <

+++=1

*** βαµ (2-15)

where is the trait value of individual i and is the mean of the model. is

the additive (marginal) effect of putative QTL r and is the coefficient, which is

unobserved but can be inferred from maker data in sense of probability, is the

epistatic effect between putative QTL r and s and t is the number of significant

pairwise epistatic effects, is the random residual effect.

iy µ rα

*irx

rsβ

ie

The likelihood function of the data given the model (2-15) is a mixture of normal

distributions as follow:

( ) (∏ ∑= =

+=

n

i jjiij

m

EDypEL1

2

1

22 ,|,, σµφσµ )

where is the probability of each multilocus genotype conditional on marker data,

is a vector of QTL parameters ( ’s and ’s), is a vector specifying the

configuration of ’s associated with each and for the th QTL genotype,

ijp

E α β jD

β*x α j

( )2,σ| µyφ denotes a normal density function for y with mean and variance

and n is the number of individuals.

µ 2σ

MIM method consists following four components:

- An evaluation procedure designed to analyse the likelihood of the data given a

genetic model (number, positions and epistasis of QTL) (Kao and Zeng 1997).

- A search strategy optimised to select the best (better) genetic model in the

parameter space.

- An estimation procedure for all parameters of the genetic architecture of the

quantitative traits simultaneously given the selected genetic model.

- A prediction procedure to estimate or predict the genotypic values of individuals

based the selected genetic model and estimated genetic parameter values for

marker-assisted selection.

−40−

Page 47: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Among these components, the second point is the critical part for the MIM

method. In next chapter, the simulation studies have been conducted for selecting

criteria in the model selection framework.

3. Simulation Studies

3-1 Simulation Model and Data

In this section, the model and method for producing simulation data of QTL

mapping experiments will be discussed. The simulation data include two parts that are

mapping information and QTL information.

1. Genetic Model for Simulation

The following is a general genetic model for B1 or DH population with m QTLs.

∑ ∑= ⊂<

+++=m

r msriisirrsirri exxxy

1 ...1)(βαµ (3-1)

where is the trait value of individual i and i is the indexes of the individual in

population ( i = 1, 2, …, n). µ is the mean of the model. α

iy

r is the marginal effect of

QTL r and is an indicator variable denoting genotype of QTL r. is defined

by ½ and -½ for B

irx

ie

irx

1 population and 1 and –1 for DH population. is the

epistatic effect between QTL r and QTL s and m is the number of QTLs chosen for

simulation, is the residual effect of the model assumed to be normally distributed

with mean zero and variance σ

rsβ

2 = . eV

The variance of model (3-1) can be partitioned into several components such

additive variance, epistatic variance, and residual variance. Formula (3-2A) and (3-3A)

are the additive and epistatic variances for B1 population and formula (3-2B) and

(3-3B) give out the additive and epistatic variances for DH population.

( ) [ ] eIA VVVGEGEEVarGVaryVar ++=−=+= 22 )()()()(

−41−

Page 48: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

∑∑<

−+=ji

ijjii

iA rV )21(21

41 2 ααα (3-2A)

∑∑<

−+=ji

ijjii

iA rV )21(22 ααα (3-2B)

−−−−= ∑ ∑

<< <lkji jiijijklijklijI rrrV

,

2

)21()21)(21(161 βββ (3-3A)

−−−−= ∑ ∑

<< <lkji jiijijklijklijI rrrV

,

2

)21()21)(21(41 βββ (3-3B)

where is the recombination frequency between QTL i and QTL j. ijr

2. Parameter Setting

The first step of producing simulation data is to set the mapping parameters, such

as experimental population (B1 or DH), sample size (n), trait mean (µ), map function

(Haldane or Kosambi), and marker genotypes (for example, 1 for one genotype and 0

for another genotype). Especially, it is important to define chromosome information

such as chromosome number, marker number and positions for each chromosome.

Table 3-1 shows an example of parameters setting for QTL mapping information.

Table 3-1. An example of parameters setting for simulation mapping information.

Marker genotype Population

Sample Size

Trait Mean

Map Function Chromosomes Mm MM

B1 200 15.8 Haldane 9 1 0

The second step is to set the parameters of QTLs such as heritability (h2), the ratio

of epistatic variance by additive variance C, which is defined as VI / VA (see formula

3-2A and 3-3A), QTL number, positions, and effects. One example of the parameters

setting is showed in Table 3-2. By using this information, it is easy to produce the

additive (α) – epistatic (β) upper-triangle matrix as showed in Table 3-3. The QTL

effects can be adjusted according to h2, C, and as following. eV

Table 3-2. An example of parameters setting for QTL information.

Additive Effect Epistatic Effect QTL Number Heritability C = VI / VA 1Sign : Both (1:3) Sign : Same

8 0.6 0.1 2Distribution : γ-2.1 Distribution : γ-0.3 1Effects can be same direction or both directions, in which case, a ratio can be indicated. 2Effects

−42−

Page 49: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

can be chosen for different distributions, such as gamma (with one parameter), normal or even.

Assume heritability is h2 and then AI VVC /= Ge Vh

h2

21−=V .

Note: We can use formula (3-2A), (3-2B) and (3-3A), (3-3B) to calculate VI and

VA. After setting the values of αi and βij, the βij’s value should be adjusted according

to the value of C.

If 1≠=A

I

CVVR then Rijij /ββ = to ensure that R = 1 and . AI VVC /=

Finally, to standardize the QTL effects by adjusting the values of α and β using

formula eV

α and eV

β and to make sure that the value of is equal to 1. eV

Table 3-3. An example of Simulation parameters setting for positions and effects of QTLs. Here, VA = 1.364, VI = 0.136, Ve = 1.0, C = VA / VI = 0.10, h2 = 0.60. Chromosome 1 1 3 3 7 7 7 9 Positions (cM) 11.7 31.8 9.1 43.1 11.8 40.2 65.9 21.9

QTLs 1 2 3 4 5 6 7 8 1 0.958 1.364 0.102 0.000 0.000 0.173 0.000 0.000 2 -0.381 0.183 0.000 0.000 0.000 0.747 0.000 3 0.559 0.000 0.000 0.000 0.000 0.083 4 -0.024 0.335 0.000 0.070 0.240 5 0.929 0.000 0.195 0.112 6 0.482 0.000 0.182 7 1.098 0.159 8 0.668

3. Simulation Procedure

The marker genotype data and trait value for each individual can be produced

according to the mapping information and the QTL information. The basic simulation

strategy is to walk along the chromosomes and treat the marker positions and QTL

positions alike. The difference between marker and QTL is that if a marker is reached,

just record the marker genotype (0 or 1) and for a QTL, the QTL additive and epistatic

effects should add into the trait value for current individual.

For each individual, the simulation starts from the first marker of each

chromosome. By 50% chance, the first marker genotype will be 0 or 1 and record it.

To next marker or QTL position, the chance of obtaining certain type of genotype is

according to the recombination frequency between previous position and the current

−43−

Page 50: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

position. For example, if the distance between these two positions is 10cM and the

Haldane map function has been used, according to formula (2-1), the recombination

frequency was 0.091. Therefore, the current genotype will be of difference with

previous one only by the chance of 9.1%. After deciding the genotype for current

position, we can record the genotype value or add QTL additive effect into the trait

value. The procedure will continue until all markers and QTLs have been reached.

Then, the QTL epistatic effects can be add. After adding the trait mean and the

random residual effect, the trait value for current individual was obtained.

4. Format of the Simulation Data

Table 3-4 shows an example of QTL mapping simulation data. The first part of

the data is the marker genotype that is the records for every marker position of the

whole genome. For inbred line, the possible marker genotypes are 3 for Intercross

population and 2 for Backcross, DH, and RIL population. Usually we use different

numbers to represent the different marker genotypes. To use 2 denote genotype AA, 1

to denote Aa, and 0 to denote aa is one of the examples. The second part is the trait

value, which is the joint effect of several factors. These factors include trait mean

value, heritability, and QTL positions and effects (additive, dominance for intercross

population, and possible epistatic effects). In order to analyse the simulation data,

other information besides the marker data and trait value are needed as well. That

includes the map information such as map function, marker positions, and population

types etc.

Table 3-4. An example of the simulation data with 5 individuals.

Individuals Marker Data Trait Value 1 111011111111111111111110000000000000000001 13.7218 2 11111111000000100000000000000111111111111 16.1372 3 11111110000000000000000000001111100000000 15.5285 4 11111100000001111000000011110000000000001 13.5461 5 00111100000000000000011111111000000000000 16.4589

−44−

Page 51: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

3-2 Single Marker Analysis

In this section, the simulation study has been conducted for single marker analysis.

The simulation design is based on: replications of 500, sample size of 200, B1

population and trait mean of 15.6, Haldane mapping function, total chromosome

number of 3 with marker number of 12, 11, and 15 respectively, average marker

distance of 10 cM with positions having certain deviation (see Table 3-5). We set

totally 5 QTLs for the whole genome and the heritability is 0.6. Among the QTLs,

only one QTL is set for chromosome 1 and chromosome 2. The other 3 linked QTLs

have been set on the chromosome 3.

In Table 3-5, the t statistic (t-Val) for each marker is calculated by formula (2-3)

and the LR value is obtained according to formula (2-4). This analysis is also fitting

the data to the simple linear regression model and the estimators

of and for each marker is also being estimated. The t statistic is for the

hypothesis that the marker is unlinked to the quantitative trait. The column headed by

Pr(t-Val) is the probability that the trait is unlinked to the marker. Significances at the

5%, 1%, 0.1% and 0.01% levels are indicated by *, **, *** and ****, respectively.

iii XY εββ ++= 10

0β 1β

For the QTL with median effect (0.754) in chromosome one, the estimation of

QTL position is reasonable accurate by the indication of significance level. However,

the range for the estimation of QTL position is much wide (marker 6 to marker10) for

the QTL with large effect (-1.331) in chromosome two. In the multiple-linked QTL

situation on chromosome three, there is no way to distinguish these three QTLs

because almost all markers have very high significance levels. All QTL effects cannot

be estimated by using single maker method because the QTL positions and effects are

confounded. From this simulation study, it is clear that the single marker method have

the power to detect the markers associated with the existed QTLs.

−45−

Page 52: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-5. Simulation result of single marker analysis (average of 500 replications).

1Chr 2QPos 3Effect 4Mk 5MPos t-Val LR β0 β1 Pr(t-Val) 1 1 0.0 1.79 3.17 14.84 0.33 0.07566 2 12.7 2.14 4.53 14.79 0.42 0.03375 * 3 23.7 2.56 6.46 14.75 0.52 0.01115 * 4 28.6 2.80 7.70 14.72 0.57 0.00559 ** 43.20 0.754 5 41.7 3.51 11.98 14.64 0.73 0.00055 *** 6 49.9 3.19 9.95 14.67 0.66 0.00164 ** 7 59.9 2.65 6.93 14.73 0.54 0.00861 ** 8 68.0 2.30 5.24 14.77 0.46 0.02239 * 9 74.7 2.09 4.31 14.80 0.41 0.03823 * 10 90.8 1.68 2.81 14.85 0.30 0.09424 11 98.1 1.49 2.21 14.88 0.25 0.13774 12 108.2 1.33 1.78 14.91 0.20 0.18347

2 1 0.0 1.68 2.83 15.16 -0.31 0.09370 2 5.5 1.83 3.35 15.18 -0.34 0.06804 3 18.0 2.24 4.98 15.23 -0.45 0.02596 * 4 27.9 2.72 7.27 15.28 -0.55 0.00709 ** 5 34.8 3.06 9.16 15.32 -0.63 0.00251 ** 6 52.2 4.31 17.74 15.45 -0.89 0.00003 **** 7 63.0 5.37 26.91 15.55 -1.10 0.00000 **** 71.60 -1.331 8 73.5 6.27 35.86 15.63 -1.26 0.00000 **** 9 79.7 5.44 27.58 15.56 -1.11 0.00000 **** 10 93.2 4.05 15.76 15.42 -0.84 0.00007 **** 11 97.5 3.71 13.28 15.39 -0.77 0.00027 ***

3 1 0.0 5.79 30.96 14.42 1.17 0.00000 **** 2 7.2 6.87 42.44 14.32 1.37 0.00000 **** 17.50 1.217 3 18.6 8.80 65.63 14.17 1.66 0.00000 **** 4 30.4 7.97 55.20 14.24 1.54 0.00000 **** 5 35.0 7.81 53.27 14.25 1.51 0.00000 **** 46.30 0.484 6 50.3 7.31 47.38 14.29 1.43 0.00000 **** 7 58.9 6.70 40.48 14.34 1.33 0.00000 **** 8 79.0 6.14 34.53 14.39 1.24 0.00000 **** 84.40 0.711 9 80.0 6.14 34.57 14.39 1.24 0.00000 **** 10 97.9 4.62 20.29 14.53 0.95 0.00001 **** 11 102.8 4.18 16.74 14.57 0.86 0.00004 **** 12 112.0 3.50 11.85 14.64 0.72 0.00058 *** 13 119.4 3.02 8.94 14.70 0.62 0.00284 ** 14 138.0 2.19 4.72 14.80 0.42 0.03001 * 15 139.7 2.13 4.49 14.80 0.40 0.03444 *

1Chromosome. 2QTL position in cM. 3QTL effect. 4Marker number. 5Marker position in cM.

−46−

Page 53: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

3-3 Comparing Different Mapping Method

It is helpful to know the advantages and disadvantages of QTL mapping methods

before choosing them for a particular QTL mapping experiment. In this section, using

DH as the model population, simulation studies were conducted for comparing the

performances among three methods of IM, CIM, and MCIM under the simple additive

model. The information was presented for QTLs about the positions and effects,

detection power, and probability of false QTLs detected.

1. Parameters Setting

In this study, the simulation design is based on: replications of 500, sample size of

200, population mean of 15.6, Haldane mapping function, total chromosome number

of 9, marker number of 11 for each chromosome, and average marker distance of 10

cM with positions having certain deviation (Figure 3-1). We set totally 7 QTLs for the

whole genome and the heritability is 0.6. Among these QTLs, there are 2 QTLs with

large effects, 3 QTLs with median effects, and 2 QTLs with small effects. There are

opposite sign for 1 QTL with median effect and 1 QTL with small effect as compared

to the other QTLs. According to the QTL number in one chromosome, we have

constructed two different QTL models: Model-I has only one QTL and Model-II has

multiple QTLs for one chromosome.

2. Estimation of QTL Effects

The estimation of QTL effects for the one QTL Model-I and multiple QTL

Model-II by using IM, CIM, and MCIM methods has been showed in Table 3-6. The

estimators were obtained by averaging all effects on each known QTL position over

the 500 replications. For Model-I, the estimated QTL effects are very close to the

parameter value. It is implied by the results that the estimation of QTL effects is

unbiased for all the three QTL mapping methods (IM, CIM, and MCIM) under the

one QTL model. Unlike Model-I, the estimation of QTL effects has small bias on the

multiple QTL Model-II due to the linkage between QTLs. For the two QTLs with

large effects (Q2-1L and Q2-2L), the effects have been apparently overestimated by

−47−

Page 54: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

IM method. The QTL effects have been underestimated by all the three mapping

methods for the two QTLs with small effects (Q3-1S and Q3-2S). The situation is

mixture for the three median-effect QTLs (Q1-1M, Q1-2M, and Q1-3M) with some

QTL effects underestimated and some overestimated. Especially for the QTL Q1-3M,

the bias is quite serious for the IM method. However, the estimation bias of QTL

effects is quite small for QTLs with large and median effects, especially by using the

CIM and MCIM methods.

Table 3-6. The simulation results of the QTL effect on the QTL positions for Model-I and Model-II.

MODEL-I MODEL-II 1QTLs 2Eff IM CIM MCIM QTLs Eff IM CIM MCIM Q1-1L 1.50 1.51 1.51 1.54 Q1-1M 0.64 0.92 0.78 0.89 Q2-1M 0.69 0.70 0.69 0.72 Q1-2M 0.64 0.84 0.52 0.76 Q3-1S 0.17 0.17 0.18 0.18 Q1-3M -0.64 -0.17 -0.56 -0.43 Q4-1L 1.50 1.48 1.48 1.53 Q2-1L 1.39 1.71 1.40 1.46 Q5-1M -0.69 -0.69 -0.69 -0.69 Q2-2L 1.39 1.71 1.38 1.48 Q7-1M 0.69 0.70 0.70 0.71 Q3-1S -0.16 -0.06 -0.08 -0.06 Q9-1S -0.17 -0.16 -0.17 -0.17 Q3-1S 0.16 0.09 0.09 0.10

1QTL = QTL with chromosomal number and serial number followed by effect (L-large, M-median or S-small), 2Eff = effect of QTLs.

3. Power and False Positive

Simulation results were presented for power of QTL detection and probability of

false QTL identified under the different thresholds by using IM, CIM, and MCIM

methods for Model-I and Model-II (Table 3-7 and Table 3-8). These kinds of

information were obtained by analysing the LR peaks from the LR profile for each

chromosome. A detected QTL is defined by having a valid LR peak with the highest

LR value that is greater than a predefined threshold. If a detected QTL matched with

the predefined QTL, the QTL will then be counted for calculating power of QTL

detection. However, if the detected QTL cannot match with any predefined QTLs in

the same chromosome, and it will be counted as a false QTL. It is obvious that the

predefined threshold value is very important for mapping QTL. By decreasing the

threshold value, it will increase the power of QTL detection and the probability of

false QTL detected. The reverse is also true.

−48−

Page 55: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-7. Power of QTL detection and the probability of false QTL detected under different thresholds for Model-I.

1LOD = 2.0 LOD = 2.5 LOD = 3.0 QTL IM CIM MCIM IM CIM MCIM IM CIM MCIM

Q1-1L 100 100 100 100 100 100 100 100 100 Q2-1M 78.0 41.8 79.8 70.6 29.8 70.6 63.0 Q3-1S 2.6 3.6 0.6 0.6 1.8 0.4 0.4 0.8 Q4-1L 100 100 100 100 100 100 100 100 Q5-1M 53.6 90.8 80.2 37.6

53.4 84.8 1.0

99.8 81.6 70.8 24.0 71.0 60.4

Q7-1M 54.2 88.0 78.0 37.4 81.6 69.6 28.2 73.4 61.6 Q9-1S 2.8 2.8 5.0 0.6 2.2 2.2 0.4 1.0 0.8

2FQTL1 34.4 29.2 29.2 27.2 22.2 22.2 21.6 16.6 18.4 3FQTL2 9.8 5.8 12.0 5.6 2.4 7.8 3.0 0.4 4.0

4FQTL2+ 2.6 0.8 4.4 0.8 0.6 0.8 0.6 0.4 0.6 1 2

3 4LOD = threshold, FQTL1 = the probabilities of false for detecting one QTL in the whole

genome, FQTL2 = the probabilities of false for detecting two QTLs, and FQTL2+ = the probabilities of false for detecting more than two QTLs.

For Model-I (Table 3-7), the three mapping methods were equally efficient in

detecting QTLs with large effects (Q1-1L and Q4-1L). However, QTLs with very

small effect (Q3-1S and Q7-1S) could only be detected with very low efficiency. But

the power of detecting QTL with median effect will be affected by choosing different

QTL mapping methods and various threshold values. CIM method tended to have the

highest power values among these three mapping methods. While MCIM method is

more efficiency than IM method. In case of the probability for false QTL detection,

IM method in general gave more false QTLs under the three threshold values. The

methods of CIM and MCIM had similar likelihood of finding one false QTL. But

CIM method was better than MCIM method when considering two and more false

QTLs detection.

For multiple QTL model (Model-II in Table 3-8), all three mapping methods have

the high efficiency for detecting QTLs on the same chromosome with large effects

(Q2-1L and Q2-2L). If there are QTLs with very small effect (Q3-1S and Q3-2S) on

one chromosome, they are almost undetectable by these three methods. IM method

cannot detect the QTL with negative median effect (Q1-3M), which was linked to

QTLs with positive effect (Q1-1M and Q1-2M). CIM method tended to be more

efficiency than MCIM method expect for one QTL (Q1-2M) being closely linked to

−49−

Page 56: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

another (Q1-1M) with the same direction of effects. IM method gave high probability

of false QTL detection as compared to other two mapping methods. CIM method

tended to have smaller likelihood of finding false QTLs.

Table 3-8. Power of QTL detection and the probability of false QTL detected under different thresholds for Model-II.

LOD = 2.0 LOD = 2.5 LOD = 3.0 QTL IM CIM MCIM IM CIM MCIM IM CIM MCIM

Q1-1M 81.2 100 84.4 75.8 100 83.0 67.6 99.6 80.4 Q1-2M 14.4 18.8 22.8 13.4 14.6 21.6 11.2 10.0 20.2 Q1-3M 0.8 65.4 38.6 0.2 55.8 30.2 0.0 43.8 25.2 Q2-1L 99.0 99.8 100 99.0 99.8 100 99.0 99.8 99.6 Q2-2L 99.4 100 99.8 99.4 100 99.8 99.4 100 99.8 Q3-1S 0.4 0.6 2.2 0.0 0.0 0.6 0.0 0.0 0.0 Q3-2S 0.8 1.8 1.4 0.4 0.4 0.8 0.0 0.4 0.4 FQTL1 42.6 18.8 31.8 45.6 11.2 22.4 45.2 7.2 17.0 FQTL2 24.4 3.0 12.0 21.6 1.0 7.8 20.6 0.8 5.6

FQTL2+ 7.8 0.0 3.4 6.2 0.0 1.0 6.0 0.0 0.4

It is implied that the density of the genetic marker will affect both the power of

QTL detection and the probability of false QTL detected as showed in Table 3-9.

When marker density increases, there is no apparent gain of power for detecting QTLs

with large effects (Q2-1L and Q2-2L) by three QTL mapping methods. But MCIM

method tends to be more powerful than the other two methods (IM and CIM) for

detecting QTLs with small effects. When considering the power of detecting linked

QTLs with reverse effects (Q1-2M and Q1-3M), MCIM method has a great

improvement, while CIM method performs quite poor. It may suggest that increasing

marker density is sometime even harmful for the CIM method. The QTL Q1-3M is

still cannot be detected by IM method as the marker density increased.

The impact of the sample size on the power of QTL detection and the probability

of false QTL detected is showed in Table 3-10. Basically, the power of the QTL

detection will increase as the sample size increased for all the three mapping methods.

Especially, the CIM method has obtained large improvement both in power of QTL

detection and probability of false QTL detected after the sample size is increasing to

300.

−50−

Page 57: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-9. Power of QTL detection and the probability of false QTL detected under Model-II when chromosomes = 3, average marker distance = 4 cM, and threshold value is LOD = 2.5.

QTL IM CIM MCIM Q1-1M 72.8 99.8 76.8 Q1-2M 27.8 13.6 47.0 Q1-3M 1.6 41.8 44.2 Q2-1L 98.2 100 98.8 Q2-2L 99.6 100 99.8 Q3-1S 0.4 0.8 4.4 Q3-2S 0.0 0.4 3.0 1FQTL 81.0 49.6 48.4

1Probability of false QTL detected in whole genome.

Table 3-10. Power of QTL detection and the probability of false QTL detected under Model-II for different sample sizes (threshold value with LOD = 2.5).

QTL Samples = 100 Samples = 300 IM CIM MCIM IM CIM MCIM Q1-1M 37.6 66.8 59.0 86.1 98.9 86.3 Q1-2M 10.2 21.4 18.4 16.1 44.9 29.3 Q1-3M 0.4 21.8 11.2 2.2 76.9 48.5 Q2-1L 96.0 98.8 97.2 100 100 100 Q2-2L 91.2 98.6 95.0 100 100 100 Q3-1S 0.0 0.6 1.2 0.7 0.9 1.3 Q3-2S 0.0 0.6 0.6 0.0 1.1 1.1 FQTL 42.0 50.2 33.4 87.0 15.9 29.1

The performance of the QTL mapping analysis will also be affected by the

adjusted factors of the method itself. Before the QTL mapping analysis, the CIM

method needs to set the parameters such as “window size” and “control marker

numbers”. In this simulation study, we simply use the default parameters and that is

10 cM for the “window size” and 5 for the “control marker numbers”. However,

sometimes the change of these parameters in CIM method has a great influence on the

power of QTL detection and the probability of false QTL detection as showed in

Table 3-11. On the other hand, because the MCIM method treats the background

control markers as random effects, the influence of the control markers is much less

than that of CIM method.

−51−

Page 58: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-11. Power of QTL detection and the probability of false QTL detected under different number of background control markers in model-II (threshold value with LOD = 2.5).

QTL CIM MCIM 1Mn = 5 Mn = 10 Mn = 25 Mn = 5 Mn = 10 Mn = 25

Q1-1M 100 100 98.6 83.0 78.2 78.4 Q1-2M 14.6 25.9 26.2 21.6 28.2 28.4 Q1-3M 55.8 67.0 63.8 30.2 43.2 51.6 Q2-1L 99.8 100 100 100 100 100 Q2-2L 100 100 99.2 99.8 99.6 99.8 Q3-1S 0.0 0.9 1.0 0.6 1.8 2.8 Q3-2S 0.4 0.0 1.8 0.8 2.4 1.8 FQTL 12.2 31.2 65.6 31.2 34.8 39.4

1Mn = Number of control marker.

4. Positions and Effects of Detected QTLs

The summary of the position estimation and the 95% experimental confidence

interval (ECI) for detected QTLs was presented in Table 3-12 for Model-I and

Model-II with threshold setting to LOD = 2.5. For the two QTLs with large effects,

the estimation of position is quite accurate with small ECI for all three mapping

methods. The average range of ECI is 14cM, 8.3cM, and 9.5cM for IM, CIM, and

MCIM methods. Unlike the CIM and MCIM methods, the average range of ECI

increases largely (11cM to 17cM) from Model-I to Model-II for the IM method. As

the QTL has median effect, the estimation of the QTL position becomes less accurate

and the ECI becomes larger. For example, the average range of ECI is almost doubled

for the median effect QTLs in Model-I by using CIM and MCIM methods (15cM for

CIM and 20cM for MCIM). For the two small effect QTLs, it is difficult to obtain a

good estimation for the QTL position and a reasonable ECI because this kind of QTL

can only be detected very few times in 500 replications due to the extreme low power

of QTL detection.

For the single QTL Model-I, the estimated effects of detected QTLs for the two

large QTLs (Q1-1L and Q4-1L) are unbiased as showed in Table 3-13. However, the

estimation of QTL effects tends to be overestimated for the QTLs with median and

small effects. The reason is that the detection power for this kind of QTL is much less

than 100%. That is, we only pick the large LR peak (greater than the predefined

threshold value) as the identified QTL for each replication. It is obvious that the large

−52−

Page 59: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

LR peak tends to have the large estimation of QTL effect as compared to the small LR

peak. Therefore, in the real QTL mapping situation, if you identified a QTL with

median or small effect, it is likely to have slightly overestimated effect. The

overestimation in QTL effect could be larger for two linked QTLs as Q2-1L and

Q2-2L at Model-II. This may imply that the QTL linkage will affect the estimation of

QTL effects. To compare the three QTL mapping methods, CIM method performs

well for the estimation of QTL effects and the ECI for QTLs with median effects,

partially due to the high power of QTL detection for these kinds of QTL.

Table 3-12. The simulation results of the position for the detected QTLs under the Model-I and Model-II when the threshold value is setting to LOD = 2.5.

QTL 1Pos IM CIM MCIM Genome 2Est 3ECI Est ECI Est ECI

Q1-1L 12.1 12.2 ( 8 — 17) 12.4 (8 — 17) 12.4 (8 — 16) Q2-1M 30.7 30.5 (18 — 45) 27.0 (20 — 36) 29.6 (18 — 42) Q3-1S 75.5 81.6 (69 — 89) 81.6 (77 — 89) 79.1 (62 — 90) Q4-1L 82.5 82.5 (76 — 89) 82.5 (78 — 87) 82.2 (76 — 88) Q5-1M 98.8 97.4 (85 —102) 97.5 (87 —102) 96.7 (84 —102) Q7-1M 8.8 7.7 ( 0 — 21) 9.2 (5 — 19) 8.1 (0 — 18)

Model-I

Q9-1S 50.6 41.8 (41 — 45) 53.1 (39 — 66) 46.6 (36 — 56)

Q1-1M 12.1 14.9 (4 —26) 16.8 (8 — 23) 16.2 (4 —26) Q1-2M 30.7 29.6 (28 —37) 31.1 ( 26 — 41) 30.5 (28 —42) Q1-3M 75.5 74.7 (75 —75) 75.4 ( 66 — 83) 75.1 (64 —84) Q2-1L 8.8 10.1 (4 —20) 8.8 ( 6 — 12) 9.1 (4 —12) Q2-2L 82.5 79.5 (68 —86) 82.5 ( 77 — 86) 81.9 (76 —86) Q3-1S 50.6 46.0 (44 —48)

Model-II

Q3-2S 98.8 85.9 (85 —87) 106.0 (106 —106) 101.0 (90 —106)

1Pos = position of QTLs. 2Est = estimated QTL position, 3ECI = 95% experimental confidence interval for QTL position.

Note: the blank table cell with “” is caused by 0 detection power.

−53−

Page 60: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-13. The summary of the effects for the detected QTLs under the Model-I and Model-II when the threshold value is setting to LOD = 2.5.

QTL 1Eff IM CIM MCIM Genome 2Est 3ECI Est ECI Est ECI

Q1-1L 1.50 1.54 (1.1—1.9) 1.53 ( 1.2—1.9) 1.59 ( 1.2—2.1) Q2-1M 0.69 0.94 ( 0.8—1.3) 0.77 ( 0.5—1.1) 0.86 ( 0.6—1.4) Q3-1S 0.17 0.85 ( 0.7—1.0) 0.59 ( 0.6—0.6) 0.67 ( 0.5—0.9) Q4-1L 1.50 1.50 ( 1.1—1.9) 1.49 ( 1.1—1.8) 1.54 ( 1.1—2.0) Q5-1M -0.69 -0.92 (-1.2— -0.7) -0.75 (-1.0 — -0.6) -0.79 (-1.2 — -0.6) Q7-1M 0.69 0.95 ( 0.7—1.3) 0.76 ( 0.6—1.0) 0.83 ( 0.6—1.2)

Model-I

Q9-1S -0.17 -0.89 (-0.9 — -0.9) -0.59 (-0.7 — -0.5) -0.73 (-1.0 — -0.6)

Q1-1M 0.64 1.05 ( 0.8—1.4) 1.08 ( 0.7—1.4) 1.09 ( 0.7—1.5) Q1-2M 0.64 1.00 ( 0.8—1.3) 0.93 ( 0.7—1.2) 1.07 ( 0.6—1.6) Q1-3M -0.64 -0.81 (-0.8 — -0.8) -0.71 (-1.0 — -0.6) -0.80 (-1.5 — -0.5) Q2-1L 1.39 1.75 ( 1.4—2.1) 1.42 ( 1.1—1.7) 1.53 ( 1.1—2.0) Q2-2L 1.39 1.75 ( 1.4—2.1) 1.40 ( 1.1—1.7) 1.53 ( 1.1—2.0) Q3-1S -0.16 -0.73 (-1.0 — -0.6)

Model-II

Q3-2S 0.16 0.80 ( 0.8—0.8) 0.55 ( 0.5—0.6) 0.71 ( 0.6—0.8) 1Eff = effect of QTLs. 2Est = estimated QTL effect, 3ECI = 95% experimental confidence interval for QTL effect.

5. The LR Profile

For these three QTL mapping methods (IM, CIM, and MCIM), the average

mapping results of the two QTL models were showed in Figure 3-1. For Model-I, all

three mapping methods performed quite well because of the unbiased estimation of

QTL positions and effects as well as the LR values depended on the QTL effects. For

Model-II, the two QTLs with small effects (Q3-1S and Q3-2S) are undetectable by all

the three methods of QTL mapping. These three methods have very larger power to

detect QTLs with large effects (Q2-1L and Q2-2L). However, comparing to CIM and

MCIM methods, IM method has more noise between these two QTLs with large

effects and this kind of noise could be harmful when these LR peaks were considered

as QTLs. For the three QTLs with median effects on chromosome 1, the highest LR

value is obtained by CIM method for Q1-1M and Q1-3M, but by MCIM method for

Q1-2M. IM method has very low LR value for Q1-3M with the possible reason of no

−54−

Page 61: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Figuprofshorhorishow

Q

z

Q1-1L Q2-1M Q3-1S Q4-1L Q5-1M Q7-1M Q9-1S

1-1M Q1-2M Q1-3M Q2-1L Q2-2L Q3-1S Q3-2S

re 3-1. The simulation 500 average QTL mapping LR profiles and additive effect iles for the two QTL setting models. The long vertical bars are chromosomes and the t vertical bars are QTL positions and effects. The small dots distributed along the ontal bars are genetic markers. Only the chromosomes with QTL (1, 2, 3) have been ed for Model-II.

−55−

Page 62: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

detection power for opposite effects of linked QTLs. For Q1-2M, CIM method has

very low LR value because of the closeness of the first two QTLs.

3-4 Consider the Complicated QTL Mapping Situations

In Section 3-3, the performances of IM, CIM, and MCIM QTL mapping methods

under the simple additive situation have been studied. However, for many real QTL

mapping experiments, more complicated situations such as QTL by environment

interaction and QTL epistasis are existed generally. In this section, we will conduct

the simulation studies for IM and CIM methods under these complicated QTL

mapping situations. The performance of MCIM method will also be studied when the

relative mixed linear models have been used for QTL by environment model (Model

2-13) and QTL epistatic model (Model 2-14).

1. Parameters Setting

For studying the QTL by environment interaction, the Model-AE for the

simulation study is based on following parameters setting: total replications for the

simulation is 300, using DH population with 100 individuals, 3 environments each has

2 repeats, population mean of 15.6, Haldane mapping function, whole genome has 3

chromosomes each with 11 markers, and average marker distance of 10 cM with

positions having certain deviation. Each chromosome has one QTL and the

heritability is 0.36. The QTL positions, the QTL main effects and QE interaction

effects are showed in Table 3-14. Notice that among these three QTLs, Q1-1 has large

QE interaction effect but no main effect. Q2-1 and Q3-1 have the same QTL main

effects. Q2-1 has QE interaction effect but Q3-1 has no QE interaction effect.

Table 3-14. QTL parameters setting of Model-AE for simulation study of QTL by environment interaction.

QE Interaction Effects QTLs 1Chr. 2Pos. Main Effect (Q) 3QE1 QE2 QE3 Q1-1 1 22.0 0.00 -0.34 -0.38 0.72 Q2-1 2 54.7 0.62 0.28 0.39 -0.67 Q3-2 3 96.4 0.62 0.00 0.00 0.00

1Chromosome. 2QTL positions in cM 3QE1, QE2 and QE3 are the QTL effects for environment 1, environment 2, and environment 3.

−56−

Page 63: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-15. QTL setting of Model-AA and Model-A for mapping QTL with epistatic effects.

QTLs 1Chr 2Pos 3Additive QTLI QTLj 4Epistasis Q1-1 1 12.2 0.80 Q1-1 Q2-2 0.48 Q1-2 1 30.7 0.80 Q1-2 Q1-3 0.96 Q1-3 1 75.5 -0.80 Q2-1 Q2-2 -0.96 Q2-1 2 8.8 1.61 Q2-2 Q3-1 1.92 Q2-2 2 82.5 0.00 Q3-1 3 50.6 0.27 Q3-2 3 98.8 -0.27

1Chromosome. 2Positions in cM. 3QTL additive effects. 4QTLs epistatic effects.

The Model-AA and Model-A are used for studying the QTL epistasis and the

simulation study is based on following parameters setting: total replication of 300,

using DH or B1 population with 200 individuals, population mean of 15.6, Haldane

mapping function, whole genome has 3 chromosomes each with 11 markers, and

average marker distance of 10 cM with positions having certain deviation. QTL

setting for Model-AA is showed in Table 3-15 and the QTL setting for Model-A is just

like Model-AA except that there are no QTL epistatic effects between QTLs.

Therefore, the Model-AA is a additive-epistatic QTL model and the Model-A is a

simple additive model, which is similar for the Model-II in section 3-3. The only

difference between the Model-AA and Model-A is the QTL epistatic effects.

2. Performance of IM and CIM Methods

- QTL by Environments Interaction

The estimators of Model-AE in Table 3-16 were obtained by averaging all effects

on each known QTL position over the 300 replications. From this simulation study,

the estimation of main effect is unbiased when mapping QTL by using data of all

environments together for the IM and CIM methods. If the QTL has no main effect

like Q1-1, the estimation of QE interaction effects is also unbiased when using data

from the particular environment. For QTL without the QE interaction effect (Q3-1),

the estimation of effects across different environment is very similar as the estimated

main QTL effect. However, for the QTL with main effect and QE interaction effects

(Q2-1), the estimation of effects by using the data from a specific environment is

−57−

Page 64: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

biased. Actually, the estimation of QTL effects in specific environment is the sum of

QTL main effect and QE interaction effects.

Table 3-16. Estimation of effects of Model-AE on the QTL positions for IM and CIM methods under various environments (Threshold LOD = 2.5).

1E1-3 2E1 E2 E3 QTLs IM CIM IM CIM IM CIM IM CIM

Q1-1 0.00 0.00 -0.35 -0.36 -0.40 -0.39 0.74 0.74 Q2-1 0.62 0.62 0.90 0.90 1.03 1.04 -0.05 -0.05 Q3-1 0.63 0.62 0.62 0.63 0.63 0.65 0.63 0.63

1Use data from all environments together. 2E1, E2, and E3 represent only using the data from environment 1, environment 2, and environment 3 respectively.

For using data of various environments, the power of QTL detection and the

probability of false QTL detected are showed in Table 3-17. For the first QTL (Q1-1),

there is no QTL detect power when using data from all environments together because

its main QTL effect is 0. The power is quite low when using the data of environment

1 and environment 2 due to the small effects of QE1 and QE2. However, the power of

QTL detection is quite high in environment 3 because the QTL effect of QE3 is

relatively high (0.72). Q2-1 has both main effect and QE interaction effects and the

power of QTL detection is quite high over main, QE1, and QE2. It is interesting that

the power of QTL detection over environment 3 is almost 0. The reason is not because

that QTL by environment 3 has no effect but that the effects of main effect and QE3

effect are cancelled out. For the last QTL (Q3-1), the difference for power of QTL

detection between using date from all environments together and using date from only

one environment is caused by the change of sample size and not by the QE interaction

effects.

In case of QTL detecting power, the overall performance between IM method and

CIM method is quite similar in this simulation case. There are two possible reasons.

First, the performance of IM method is quite good under the one-QTL model, and

second, small sample size (only 100 individuals and the 2 repeats seems no much help)

do more harm on CIM method than IM method. However, to consider the probability

of false QTL detected, the performance of CIM is still much better than IM method.

−58−

Page 65: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-17. The power of QTL detection and the probability of false QTL detected for IM and CIM methods under various environments (Threshold LOD = 2.5).

E1-3 E1 E2 E3 QTLs IM CIM IM CIM IM CIM IM CIM Q1-1 1.95 0.98 14.33 21.00 15.67 23.67 91.67 96.67 Q2-1 98.05 100 94.33 97.33 92.67 86.67 0.67 0.00 Q3-1 97.07 98.05 51.67 51.67 45.33 50.33 61.33 73.00

FQTL1 37.07 10.73 30.00 12.67 27.00 18.33 20.33 4.33 FQTL2 11.22 0.98 4.00 0.67 4.67 1.33 1.67 0.33 FQTL2+ 1.95 0.00 0.00 0.00 0.33 0.00 0.67 0.00

Table 3-18. Estimation of positions and 95% ECI for detected QTLs under various environments (Threshold LOD = 2.5).

E1-3 E1 QTLs IM CIM IM CIM Q1-1 25.4(12.7−32.6) 27.2(23.7−30.6) 21.5(8.0−28.6) 21.5(10.0−32.6) Q2-1 55.1(41.9−64.2) 56.0(46.8−66.2) 55.0(39.9−66.2) 55.8(44.8−66.2) Q3-1 95.4(84.9−104.0) 95.2(84.9−104.0) 95.9(82.9−106.0) 96.0(84.9−106.0)

E2 E3 QTLs IM CIM IM CIM Q1-1 22.6(12.7−32.6) 22.8(12.7−34.6) 22.8(12.7−32.6) 23.0(14.7−34.6) Q2-1 55.1(41.9−66.2) 55.8(46.8−64.2) 62.2(62.2−62.2) -------- Q3-1 95.6(82.9−106.0) 90.3(82.9−104.0) 95.6(84.9−106.0) 96.6(90.0−106.0)

Note: the bold number corresponding to the position with high power of QTL detection (greater than 90.0).

Table 3-19. Estimation of effects and 95% ECI for detected QTLs under various environments (Threshold LOD = 2.5).

E1-3 E1 QTLs IM CIM IM CIM Q1-1 -0.41(-0.46−0.35) -0.35(-0.35−-0.35) -0.66(-0.96−0.55) -0.61(-0.90−0.49) Q2-1 0.62(0.43−0.85) 0.62(0.44−0.78) 0.91(0.62−1.26) 0.91(0.60−1.20) Q3-1 0.63(0.42−0.83) 0.64(0.47−0.82) 0.77(0.57−0.99) 0.76(0.57−0.98)

E2 E3 QTLs IM CIM IM CIM Q1-1 -0.75(-1.03−0.54) -0.65(-0.96−0.51) 0.78(0.55−1.04) 0.77(0.54−1.05) Q2-1 1.05(0.73−1.36) 1.07(0.74−1.42) -0.60(-0.61−0.59) -------- Q3-1 0.78(0.57−1.06) 0.73(0.50−1.03) 0.73(0.54−0.98) 0.70(0.52−0.95)

The positions of the estimated QTLs and their %95 ECI for the detected QTLs are

showed in Table 3-18. The estimators of position are quite accurate, especially for the

QTLs with high power of QTL detection (Bold number). The 95% ECI range is about

20 cM for the most QTLs. The overall results for the estimation of positions and the

relative 95% ECI are quite similar between IM and CIM methods. For the positions

with high QTL detecting power, the estimation of QTL effects for detected QTLs

(Table 3-19) is compatible with the estimation of QTL effects for known QTLs

−59−

Page 66: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

positions (Table 3-16). However, the estimation of effects for detected QTLs is not

accurate for those positions with low QTL detecting power (Also see Table 3-17).

- QTL Epistasis

The estimators of QTL additive (main) effects in Table 3-20 were obtained by

averaging all effects on each known QTL position over the 300 replications. Although

to compare to Model-A, Model-AA has extra epistatic effects. It is interested to see

that the average estimation of QTL additive effects is very similar for both Model-AA

and Model-A by using IM and CIM methods. It implies that the epistatic effects may

not affect the average estimation for the QTL additive effects when the DH or B1

population has been used. In other words, the additive effects can still be estimated

unbiased even the epistatic effects are existed when the DH or B1 population has been

used. On the other hand, the biasness can be affected by QTLs linkage like the simple

additive case Model-II in section 3-3. However, the standard deviation for the

estimated additive effects in Model-AA is larger than that in Model-A. Therefore, QTL

epistatic effects may cause large variance for the estimation of QTL additive effects as

indicated by this simulation study.

Table 3-21 shows the summary for detection power of QTLs and the probability

of false QTL detected when Model-AA and Model-A have been used. For CIM

method, to compare to the simple additive Model-A, the detection power of QTLs is

small in Model-AA generally, especially for Q1-2, Q1-3, and Q3-2. In addition, the

probability of false QTL detected is much higher (75.67 to 9.67) in Model-AA. These

results imply that if the epistatic effects are existed, the QTL mapping results

including detection power and false QTL will be affected seriously in CIM method.

On the other hand, the impact of epistatic effects on IM method is not serious

according to this particular simulation work. To compare CIM method to IM method,

CIM has the high detection power of QTLs and IM has the low probability of false

QTL detected when having epistatic effects as demonstrated in this simulation study.

−60−

Page 67: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-20. Estimation of QTL additive effects for Model-AA and Model-A by using IM and CIM methods on the known QTL positions.

1IM &MdAA IM & MdA CIM &MdAA CIM & MdA QTLs 2Est 3SD Est SD Est SD Est SD

Q1-1 1.15 0.209 1.15 0.183 0.94 0.242 0.94 0.165 Q1-2 1.05 0.224 1.05 0.190 0.80 0.283 0.80 0.205 Q1-3 -0.22 0.207 -0.22 0.182 -0.70 0.174 -0.76 0.133 Q2-1 1.62 0.201 1.62 0.161 1.60 0.235 1.62 0.124 Q2-2 0.40 0.238 0.39 0.204 0.01 0.185 0.01 0.131 Q3-1 0.14 0.214 0.14 0.187 0.13 0.191 0.13 0.141 Q3-2 -0.13 0.241 -0.14 0.211 -0.13 0.195 -0.25 0.164

1MdAA and MdA represent QTL Model-AA and Model-A respectively. 2Estimation of QTL additive effects (average of 300 replications). 3Standard deviation.

Table 3-21. Detection power of QTLs and probability of false QTL detected for Model-AA and Model-A by using IM and CIM methods.

IM Method CIM Method QTLs Model-AA Model-A Model-AA Model-A

Q1-1 88.67 91.67 100.00 100.00 Q1-2 17.33 16.33 30.67 60.00 Q1-3 1.00 3.00 68.67 98.00 Q2-1 99.67 100.00 100.00 100.00 Q2-2 4.67 5.33 0.33 0.33 Q3-1 0.67 0.00 1.00 2.33 Q3-2 0.67 1.00 1.00 6.00

1FQTL 12.00 11.00 35.67 9.67 1Probability of false QTL detected in whole genome.

As showed in Table 3-22, the estimation of position for Q1-1 has certain bias,

especially for CIM method. This bias is caused by the fact of QTL linkage (Q1-1 and

Q1-2). For CIM method, the estimation of positions for the QTLs with media effects

(Q1-2 and Q1-3) is quite accurate in both Model-AA and Model-A. For the QTL with

large effect (Q2-1), the estimation of QTL position is biased, especially for the single

additive Model-A. In general, the estimation of QTL positions for the detected QTLs

is quite accurate for both QTL models as the IM and CIM methods have been used.

The estimation of additive effects for the detected QTLs (also see Table 3-22) is

unbiased in every situation for the QTL with large effect (Q2-1). For QTLs with

media effects, Q1-1 is overestimated caused by QTL linkage and Q1-2 is

overestimated in Model-AA and unbiased in Model-A by using CIM method. For IM

method, the estimation of effects for the detected QTLs has certain overestimation,

especially for those QTLs with median and small effects.

−61−

Page 68: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-22. Estimation of QTL positions and additive effects for the detected QTLs of Model-AA and Model-A.

IM &MdAA IM & MdA CIM &MdAA CIM & MdA QTLs 1Pos 2Eff Pos Eff Pos Eff Pos Eff

Q1-1 14.6 1.2 15.3 1.2 17.3 1.4 17.2 1.4 Q1-2 28.7 1.2 28.0 1.2 30.4 1.1 30.7 0.9 Q1-3 83.4 -0.9 77.4 -0.7 75.8 -0.8 75.5 -0.8 Q2-1 8.7 1.7 8.7 1.6 9.6 1.6 8.8 1.6 Q2-2 83.2 0.9 81.1 0.8 89.7 -0.6 89.7 0.5 Q3-1 52.0 0.7 --- --- 51.9 0.3 46.7 0.6 Q3-2 102.0 -0.8 104.7 -0.7 100.3 -0.6 98.1 -0.6

1QTL positions in cM. 2Average QTL additive effects for the detected QTLs.

3. Using MCIM Method

- QTL by Environment Interaction

By using the mixed linear model (2-13) approach, MCIM method has the ability

to analyse the QTL mapping data for all environments together. As showed in Table

3-23, the simulation result indicated that the estimation of QTL main effect and

prediction of QTL by environment interactions are unbiased. On the other hand, it is

difficult to get the unbiased estimation or prediction for QE interaction effects by

using IM or CIM method (Table 3-16).

Table 3-23. Estimation of main effect and QE interaction effects on the QTL positions for Model-AE when using MCIM method with mixed linear model approach.

Main E1 E2 E3 QTLs Chr Pos 1Q 2Est QE1 Est QE2 Est QE3 Est Q1-1 1 22.0 0.00 0.08 -0.34 -0.35 -0.38 -0.36 0.72 0.71 Q2-1 2 54.7 0.62 0.64 0.28 0.29 0.39 0.36 -0.67 -0.65 Q3-1 3 96.4 0.62 0.61 0.00 -0.02 0.00 0.02 0.00 0.00

1Real QTL effect (parameter). 2Estimation or prediction of the QTL effect (average of 300).

Table 3-24 shows the power of QTL detection and the estimation of positions and

effects for the detected QTLs of Model-AE when the MCIM method has been used.

For Q1-1, the power of QTL detection is quite low and the estimation of positions and

QTL main effect for the detected QTL has some bias. This phenomenon could be

caused by the extreme low value of QTL main effect (0.00). However, the estimation

of the QE interaction effects for the detected QTL of Q1-1 is reasonably good. MCIM

method has the high power to detect Q2-1 and Q3-1. The estimation of QTL main

effects and QE interaction effects for the detected QTLs of Q2-1 and Q3-1 is unbiased.

The range of the 95% ECI for these three QTLs is about 20cM.

−62−

Page 69: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-24. Estimation of powers, positions and effects for the detected QTLs of Model-AE when using MCIM method with mixed linear model approach.

Positions QE Interaction Effects QTL Power Est 1ECI

2Main 3QE1 QE2 QE3 Q1-1 12.5 25.6 12−34 0.17 -0.42 -0.38 0.80 Q2-1 96.4 55.0 44−64 0.64 0.31 0.38 -0.69 Q3-1 97.7 94.9 84−104 0.63 -0.03 0.01 0.02

1The 95% ECI of the QTL positions. 2The QTL main effects. 3The QTL by environment 1 interaction effects.

- QTL Epistasis

Table 3-25. The estimation of additive and epistatic effects on the QTL positions for Model-AA by using MCIM method when the mixed linear model is used (300 replications).

Additive Epistatic QTLs 1Eff 2Est

QTL i QTL j Eff Est

Q1-1 0.80 1.02 Q1-1 Q2-2 0.48 0.51 Q1-2 0.80 0.92 Q1-2 Q1-3 0.96 0.98 Q1-3 -0.80 -0.82 Q2-1 Q2-2 -0.96 -0.97 Q2-1 1.61 1.63 Q2-2 Q3-1 1.92 1.89 Q2-2 0.00 0.03 Q3-1 0.27 0.24 Q3-2 -0.27 -0.25

1Parameters setting for QTL effects. 2Estimation of the QTL effects.

According to the simulation study (3-4-2), the epistatic effects will hurt the

efficiency and the results of QTL mapping when IM or CIM method has been used. In

addition, there is no ways to estimate the QTL epistatic effects by using these two

methods. On the other hand, by using mixed linear model approach (2-14), MCIM

method can be used for analysing the QTL additive effects as well as QTL epistatic

effects at the same time by fitting two intervals into the model. QTLs with additive

effects and (or) epistatic effects can be located through a two-dimensional search

procedure (Wang 1999 at al).

Table 3-25 shows the estimation of the QTL additive effects and QTL epistatic

effects for Model-AA by using MCIM method. The simulation result indicated that the

estimation for the most additive and epistatic effects is unbiased. The linkage between

Q1-1 and Q1-2 may cause the overestimation of the additive effects for these two

QTLs (Q1-1 and Q1-2).

−63−

Page 70: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 3-26. The power of QTL detection and the estimation of additive and epistatic effects on the QTL positions for Model-AA and Model-A by using MCIM method when the mixed linear model is used (300 replications and threshold is LOD = 2.5).

Power Additive Epistatic QTLs 1MdAA 2MdA MdAA MdA

QTL i QTL j MdAA MdA

Q1-1 85.0 91.0 1.02 1.13 Q1-1 Q2-2 0.51 -0.04 Q1-2 66.0 69.7 0.92 0.96 Q1-2 Q1-3 0.98 0.07 Q1-3 61.5 54.0 -0.82 -0.71 Q2-1 Q2-2 -0.97 -0.05 Q2-1 99.3 99.7 1.63 1.63 Q2-2 Q3-1 1.89 0.03 Q2-2 43.3 0.33 0.03 0.03 Q3-1 32.0 1.67 0.24 0.14 Q3-2 1.0 1.33 -0.25 -0.15

1Model-AA. 2Model-A.

Table 3-26 shows the power of QTL detection and the estimation of QTL effects

on the known QTL positions for Model-AA and Model-A when MCIM method has

been used. Q2-2 and Q3-1 has obtained a big improvement in QTL detection power

for Model-AA in contrast with Model-A and it is also true for Q1-3. This result

implied that the QTL epistatic effects could improve the QTL detection power when

using MCIM method. The estimation of QTL epistatic effects for the simple additive

Model-A is almost 0. It proved that there is no harm for mapping the simple additive

model with the extended epistatic model of the MCIM method.

Table 3-27. The power of QTL detection, probability of false QTL detected, and the estimation of QTL positions for the detected QTLs for Model-AA and Model-A by using MCIM method with simple additive model (300 replications and threshold is LOD = 2.5).

Power QTL position (ECI) QTLs Model-AA Model-A Model-AA Model-A

Q1-1 83.33 91.67 15.6 (6, 26) 15.8 (8, 26) Q1-2 29.33 38.33 28.9 (19, 35) 28.8 (17, 35) Q1-3 35.67 57.67 75.0 (64, 85) 75.2 (64, 83) Q2-1 100.00 100.00 8.6 (6, 12) 8.5 (6, 12) Q2-2 1.67 0.67 82.9 (73, 96) 86.4 (83, 90) Q3-1 2.33 2.33 49.6 (40, 60) 47.2 (37, 53) Q3-2 0.67 1.33 93.0 (92, 94) 105.0 (104, 106)

1FQTL 11.00 7.67

Table 3-27 is the simulation result about the power of QTL detection, probability of

false QTL detected, and the estimation of QTL positions for the detected QTLs when

using MCIM method with the simple additive model. It is implied that if using the

simple additive model, the power of QTL detection will decrease and the probability

of false QTL will increase when the epistatic effects existed. On the other hand, the

estimation of QTL positions is unbiased under this situation.

−64−

Page 71: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

4. Model Selection and Criteria

4-1 MIM and Model Selection

Multiple Interval Mapping (MIM) method utilizes multiple marker intervals

simultaneously to construct multiple putative QTLs in the model for QTL mapping. It

is a multiple QTL oriented method combining QTL mapping analysis with the

analysis of genetic architecture of quantitative traits. Through a search algorithm, the

method can obtain the detail information about the QTLs simultaneously such as

number, positions, effects and interaction of the significant QTLs.

The search strategy of MIM method is to select the best (or better) genetic model

in the parameter space. In other words, it is a model selection problem. Therefore,

model selection is the key component of the analysis and the basis of the genetic

parameter estimation and data interpretation in any QTL mapping methods by using

multiple intervals. The analysis of model selection in a high and unknown dimension

is very complicated. The appropriate criteria or stopping rules used for model

selection are greatly important but very difficult to decide.

In this research, we will study the properties of the criteria for model selection in

the QTL mapping framework. Here only the idea case is considered. First the cross

design is backcross using pure-breeding parental lines for its simplicity. However, the

result can be extended to other experimental design with only two different marker

genotypes such as DH and RIL population. For more complicated population such as

F2, the basic principles will be hold. Secondly, assume all the effects of the QTLs are

the same and all markers are equally spaced for the sake of standardizing the criteria.

Finally, All QTLs are exactly position on the markers.

As an example, the parameters setting for the starting model (Model-S) is: sample

size is 300, whole genome has 3 chromosomes, 14 evenly distributed markers for each

chromosome and the marker distance is 8 cM, setting totally 8 QTLs with same effect

and 4 QTLs on chromosome 1, 1 QTL on chromosome 2, and 3 QTLs for

−65−

Page 72: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

chromosome 3. When the heritability set to 0.8, 0.5, and 0.2, the QTL effects will be

1.014, 0.507, and 0.169 respectively.

Under this situation, we can simply do model selection on the markers by least

square and multiple regression means. The results obtained under this assumption are

still useful because first, if the marker density is high, the distance between the QTL

and the nearest marker is very small and can be ignored. Second, for the loose marker

situation, MIM can use maximum likelihood method (Kao and Zeng 1997) to estimate

the QTL position according to the information of marker genotypes and positions.

However, in case of model selection, the principle should be same. That means our

result is still useful for MIM model selection practice.

4-2 Model Evaluation Standard

Consider a multiple regression model for a Backcross population as

i

M

jijji Xy εβµ ++= ∑

=1

(4-1)

where is the trait value of individual i. is the mean of the model and M is the

number of marker fitting in the model. is partial regression coefficient (marker

effect) for maker j and is the marker indicate variable for individual i and maker

j. For the backcross population, has two possible values, for example, 1 for MM

and 0 for Mm marker genotype. is a random residual variable assumed normal

distribution with mean 0 and variance .

iy µ

2

ijX

ijX

σ

The goal of model selection is to find a better (not necessary the best) model with

M markers through a search procedure. Hopefully, these markers are QTLs in our idea

situation. By doing this, there are two possible errors we will make. The first type of

error (called ) is that some selected markers in the final model are not QTLs. This

kind of error is related to the Type I error in some sense. The second type of error

(called ) is that some QTLs (markers) are not included in the final model and this

α

β

−66−

Page 73: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

kind of error is related to the detection power (1- ) of QTL. It is very important to

balance these two types of error on the model selection practice.

β

cn

cn

,2,1

3010

As a model selection standard, we have defined three parameters

for measuring the degree of fitness between the selected model and the real model.

Assume the real QTL number is N and the identified QTL number is n, the real QTL

position (cM) is P and the identified QTL position is p and the positions are measured

from beginning of the chromosome.

χβα and,,

( ) cc

cc NifCcNnN

≥=−= ∑ ,,2,11Lα (4-2)

( )∑ ≤=−=c

ccc NifCcnNN

,,2,11Lβ (4-3)

),min(,,,2,1 cccc c

tcc

NnRtandCcR

pP===

−= ∑

∑LLχ (4-4)

where C is the chromosome number, is the percentage of wrong identified QTL

and is the percentage of missed QTL, is the average distance between the

identified QTLs and the real QTLs.

α

β χ

For each chromosome, if the real QTL number is not equal to the identified QTL

number, there will be many ways to associate the real and identified QTL positions.

Here the used criterion is to minimize the total distance for each chromosome.

4-3 Model Selection Strategy and Criteria

One of the difficult parts for model selection is that there are too many potential

models to be considered. In our situation, if the total marker number is M, there will

be about possible models exist. For example, if the total marker number is 100

for the whole genome, it will be more than1 possible models exist and it is

infeasible to test all the models for obtaining the best model.

M2

2. ×

We can divide all possible models into two major groups - models with the same

number of regressors and models with different number of regressors. If the whole

−67−

Page 74: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

genome has M markers, all possible models can be divided into M+1 classes and each

class contains the same number of regressors (from the model with M regressors to

the one with no regressor, only the model mean). The criterion for selecting the best

model among the models with same number of regressors is relatively simple. The

best model is the model with the largest coefficient of determination ( ) or the

smallest residual sum of squares (RSS). From formula (4-5) and (4-6), it is easy to see

that can be considered as a measurement for goodness of fit about the model and

maximizing the value of is equivalent to minimizing the value of RSS.

2R

2R2R

( )( )∑

−= 2

22

ˆ

YYYY

Ri

i (4-5)

(∑ −−=22 ˆ)1( YYRRSS i ) (4-6)

To compare models with different regressors is the most difficult task for model

selection. The reason is that as the number of regressors increased, the value of

never decreased and the RSS value is always decreased. Therefore, it is impossible to

decide which model is better by simply comparing the value of or RSS. One

must make a decision about what increase in is required before accepting an

additional regressor or what decrease in is accepted before dropping a regressor

from the backward way of thinking.

2R

2R2R

2R

In summary, there are two kinds of criteria we should deal with in the model

selection practice. The first case is that to find the best model inside classes with the

same number of regressors. In this case, criterion itself is simple (use or RSS) but

the difficulty will be too many models to be considered. One way to solve this

problem is to use certain procedure to search through a limited space to find a better

(there no way to guarantee the best) model. These search methods include forward,

backward, stepwise, and branch-and-bound etc. The second case is to find the best

model amount the models with different regressors by using certain criteria. Our

research will deal with the first problem but the focus is on the second one, the criteria

or stopping rule for model selection.

2R

−68−

Page 75: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

4-4 Procedure of Model Selection

The first step of model selection is to find the best (at least the better) model for

each class with same number of regressors without doing exhaustive search. For the

situation of M markers, we can find the M+1 models as

Mpp ...,,1,0=η

where p is the number of regressors in the models.

Forward stepwise selection (FW) or backward elimination (BW) method can be

used for this purpose. FW method chooses the subset models by adding one regressor

at a time to the previously chosen model. It starts by choosing the one-regressor

model by selecting the regressor with the largest sum of square (SS) contributed to the

model. At each successive step, the regressor not already in the model that causes the

largest decrease in the RSS (has largest partial sum of square value) is added to the

model. This procedure can go on until all regressors are in the model. BW method

chooses the subset models by starting with the full model and then eliminating, at

each step, one regressor whose deletion will cause the RSS to increase the least. This

will be the regressor in the current model that has the smallest partial sum of square.

This procedure can also go on until the model contains none regressor (only the

model’s mean is left).

Comparing to the exhaustive search, the FW or BW method saves great amount of

the computation time. The cost is obvious as that once a regressor is included, it will

be always stay in the further models for FW method and once a regressor is excluded,

it will be no chance to get in again in the further models for BW method. Therefore, it

is no guarantee that the models selected by FW or BW method is the best model in

each class with the same number of regressors. However, in our situation, due to the

linear structure of marker positions and QTL locations, it is expected that the model

selected by FW or BW method is the best model in each class. Zeng (1993) proved an

important property about the partial regression coefficient for multiple regression

analysis. It is that the partial regression coefficient is expected to depend only on

−69−

Page 76: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

those QTLs that are located on the interval bracketed by the two neighboring markers

if there is no crossing-over interference and no epistasis, as showed in formula (4-7).

ixgx xgx

kxx

xgk

xx

gxi

iki iki ii

ik

ii

ki

rr

rr

b δδδ =+≈ ∑ ∑<< <<− + +

+

1 1 1

1

1

1 (4-7)

In our idea situation (Model-S), the partial regression coefficient is expected to

equal to the QTL effect for markers with QTL and 0 for makers without QTL. Even

when the QTLs are not just on the markers, the partial regression coefficient of

markers near the QTLs will had large values comparing to the markers far away from

the QTLs by expectation. Because that the partial regression coefficient is directly

related to the partial sum of square. It is easy to notice that by expectation, the

markers with QTL will be selected first in FW method and the markers without QTL

will be eliminated first in BW method.

Table 4-1 shows the simulation result of the average R2 value for the true model

and the selected model by using BW method when Model-S is used. For each

replicated sample, the R2 of the true model is calculated by fitting the true parameters

(real QTL number and positions) into the multiple regression models. The R2 of the

selected model is calculated by selecting the model with 8 regressors from the full

model by using BW method. Therefore, the only difference between these two models

will be the marker (QTL) positions. It is obvious that the BW method doing very well

and the average value of R2 is even larger than the one of true model in some cases.

Due to the sample variance, the true model is not necessary the model with maximum

R2, but it is usually a good one (has small standard deviation).

Table 4-1. Comparing the coefficient of determination (R2) between the model selected by backward procedure and the true model for the Model-S. The sample size is 300 and the replication is 1000 times.

Backward Selected Models True Models Heritability 1Low 2.5% 2Mean 3Up 2.5% Low 2.5% Mean Up 2.5%

0.8 0.8029 0.8041 0.8054 0.8027 0.8043 0.8052 0.5 0.5196 0.5219 0.5242 0.5109 0.5132 0.5154 0.2 0.2524 0.2549 0.2574 0.2194 0.2218 0.2243

1The value of R2 for the lower 2.5% position. 2Average value of R2.. 3The value of R2 for the higher 2.5% position.

−70−

Page 77: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

4-5 Summary of Criteria for Model Selection

After a series of models with different number of regressors has been obtained by

BW or FW method, the more difficult question remains – to decide which model

(how many regressors) is the model we are looking for. This is the problem of

stopping rule or model selection criteria. Following are some of the criteria:

1. Adjusted R2

pnnRR pp −

−−= )1(1 22

where p is the number of regressors and n is the sample size. Because the value of R2

always increases as the number of regressors increased, the adjusted R2 is try to use

the number of regressors as a penalty to get a balance. The model with the maximum

value of adjusted R2 is the final model when using this criterion.

2. Mallow’s Cp (Mallows 1973)

npRSS

C pp −+= 2

ˆ 2σ

where is the residual sum of square for the model with p regressors and is

the estimation of residual variance . The model with the minimum value of

is the final model when using this criterion.

pRSS 2σ

C2σ p

3. Mean Squared Error Prediction (Aitkin 1974, Miller 1990)

The statistic of MSEP is often called PRESS statistic:

( ) niyyPRESSn

iii ...,,2,1ˆ

2

1=−= ∑

=

where is the trait value and is the estimation of the trait value for individual i.

The parameters for estimating are obtained by Least Square method using all

samples except the individual i. This method is usually called cross-validation and it

is possible to drop several individuals instead just dropping one at once. The model

iy iy

iy

−71−

Page 78: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

with the maximum PRESS statistic is the final model when using this criterion.

4. BIC and Related Criteria

( ) )(ˆln)( 2 npgnpB p += σ (4-8)

where p is the number of regressors in the model, n is the sample size, is the

least square or maximum likelihood estimation of residual variance for the model with

p regressors, and is some kinds of function of sample n.

2ˆ pσ

)(ng

By selecting , it is AIC criteria (Akaike 1970, 1973), is

the BIC criterion that is defined by Schwartz (1978) and Rissanen (1978). It is

possible to use other kinds of function (Debasis and Murali, 1996) as long as

they satisfies the following properties:

2)( =ng )ln()( nng =

)(ng

( )( ) ∞=

=

∞→∞→ nngand

nng

nn lnln)(lim0)(lim

Table 4-2. Relationship between sample size and the g(n) functions.

Samples [ln(n)]0.1 n0.1 AIC [nln(n)]0.1 [ln(n)]0.5 [ln(n)]0.9 BIC N0.5 [nln(n)]0.5 100 1.16 1.58 2.00 1.85 2.15 3.95 4.61 10.00 21.46 150 1.17 1.65 2.00 1.94 2.24 4.26 5.01 12.25 27.42 200 1.18 1.70 2.00 2.01 2.30 4.48 5.30 14.14 250 1.19 1.74 2.00 2.06 2.35 4.65 5.52 15.81 37.15 300 1.19 1.77 2.00 2.11 2.39 4.79 5.70 17.32 41.37 350 1.19 1.80 2.00 2.14 2.42 4.91 5.86 18.71 45.28 400 1.20 1.82 2.00 2.18 2.45 5.01 5.99 20.00 48.95 450 1.20 1.84 2.00 2.21 2.47 5.10 6.11 21.21 52.43 500 1.20 1.86 2.00 2.23 2.49 5.18 6.21 22.36 55.74

1000 1.21 2.00 2.00 2.42 2.63 5.69 6.91 31.62 83.11

32.55

Table 4-2 is the possible functions and related values under various sample

sizes. Formula (4-8) shows that the second term is a positive penalty function

and its value will decrease as the number of regressors decreased. This term is used to

balance for the first term

)(ng

)(npg

( )2ˆlog pσn that has the tendency to increase as the number

of regressors decreased. Therefore, the large value of function will select a )(ng

−72−

Page 79: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

model with fewer regressors. In other hand, if the value of function is small,

the criteria have the tendency to select a model with more regressors.

)(ng

0

0

BIC( logn ) , h2=0 .8 , Back w ar d

20

40

60

80

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40

B IC (logn), h2=0.8, Fo r w ar d

0

20

40

60

80

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40

B IC (logn), h2=0.5, Ba c k w a r d

20

40

60

80

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40

B IC (logn), h2=0.5, Fo r w a r d

0

20

40

60

80

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40

BIC( logn) , h2=0.2 , Back w ar d

0

20

40

60

80

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40

B IC( logn ) , h2=0 .2 , Fo r w a r d

0

20

40

60

80

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40

tions.

Figure 4-1. Comparison of FW and BW methods for model selection for the Model-S. Thebars indicate the frequency of selected markers on 1000 replica

4-6 Simulation Studies of Criteria

The purpose of this simulation work is to study the effects by using different

criteria for model selection. The properties of model selection include the number of

QTL selected, probability of wrong identified QTL, power and accuracy of QTL

detection. We also try to obtain the experimental criteria for model selection that can

−73−

Page 80: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

be used specifically for QTL mapping practice in the population of inbred line

crossing.

1. FW and BW Methods

Figure 4-1 shows the results of model selection for the Model-S on the different

heritabilities (0.8, 0.5, and 0.2). Both FW and BW methods have been used for model

searching and BIC criterion (See Formula (4-8) and ) has been used for the

stopping rule of model selection. The vertical bars represent the frequency of markers

selected in the 1000 replications.

( ) nng ln=

It can be implied that both FW and BW methods performance quite well here. To

compare FW method to BW method, BW method performances better in this case,

especially when the heritability is high. It is interest to see that the error rate of wrong

identified markers, BW method is usually lower than FW method, particularly on the

chromosomes with multiple QTLs (chromosome 1 and chromosome 3). However, FW

method performs very well on the chromosome with single QTL (chromosome 2).

The BW method will be used mainly as the searching method of model selection for

the rest of simulation studies.

2. Criteria and the Various Parameters

Table 4-3 shows the estimation of various parameters for the Model-S by using

various criteria of model selection. Parameters Alfa, Beta, and Distance are defined by

Formula (4-2), (4-3), and (4-4) respectively and Mean is the average marker number

included in the final model (the real QTL number is 8). It is obvious that the

heritability has a great influence on these parameters. For example, the parameter

Mean is 8.86 when h2 = 0.8 and reducing to 4.76 when h2 = 0.2 by using BIC criterion

and the average Distance between the real QTL positions and the identified positions

is increased from 0.26 cM to 7.41 cM.

The parameters Alfa and Beta are very important for evaluating the fitness

between the selected model and the real model. The Alfa is related to the probability

of wrong identified QTL and Beta is related to the QTL detect power (1.0 − beta). The

−74−

Page 81: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

loose criteria such as AIC and )ln(n have the tendency to pick a model with more

regressors and the Alfa value will be high and the Beta value will be low (power is

high). However, If the strict criteria such BIC and n have been applied, the

opposite situation will occur. The primary goal of finding a good criterion is to

balance the Alfa and Beta value.

Table 4-3. Estimation of various parameters for the Model-S (1000 replications).

Criteria of Model Selection Parameters Heritability AIC 2B^0.5 3B^0.9 BIC n^0.5

1Mean 0.8 15.34 13.84 9.40 8.86 7.90 0.5 14.23 12.82 8.45 7.69 4.24 0.2 12.17 10.64 5.66 4.76 1.85

Alfa 0.8 0.92 0.73 0.18 0.11 0.00 0.5 0.78 0.61 0.14 0.08 0.00 0.2 0.60 0.45 0.08 0.04 0.00

Beta 0.8 0.00 0.00 0.00 0.00 0.01 0.5 0.01 0.01 0.08 0.12 0.47 0.2 0.08 0.12 0.37 0.44 0.77

Distance (cM) 0.8 0.13 0.15 0.25 0.26 0.29 0.5 2.62 2.92 3.37 3.42 2.65 0.2 7.94 8.47 8.14 7.41 4.78

1Average markers selected in final model. 2 ( ) 5.0ln)( nng = . 3 ( ) 9.0ln)( nng = .

Table 4-4 to Table 4-6 is the results of the parameters estimation under different

models by using BW searching method and different criteria of model selection, such

as AIC, BIC etc. The purpose is to demonstrate the impact on various parameters

estimation by changing the facts such as chromosome numbers, marker density, and

sample size. The model in Table 4-4 is similar to the Model-S except that 2 extra

chromosomes have been added in and there are no QTLs on the new added

chromosomes. It is easy to see from the result that more chromosomes in the genome,

the more strict criteria are needed. That means if using the same criterion, the Alfa

value will increase as chromosome number increased. The same conclusion can be

made for the model with more density marker (average distance between markers is

decreased) showed in Table 4-5. Comparing to Model-S, the dense model half the

average marker distance (4 cM). However, for the model with large sample size (500)

as showed in Table 4-6, the loose criteria should be applied.

−75−

Page 82: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 4-4. Various Parameters estimation for the model with 5 chromosomes.

Criteria of Model Selection Parameters Heritability AIC 1B^0.5 2B^0.9 BIC n^0.5

Mean 0.8 23.81 20.81 11.30 9.88 7.92 0.5 22.68 19.98 10.52 8.85 4.22 0.2 20.64 17.70 7.45 5.90 1.79

Alfa 0.8 1.98 1.60 0.41 0.23 0.00 0.5 1.84 1.51 0.40 0.23 0.00 0.2 1.64 1.31 0.30 0.18 0.00

Beta 0.8 0.00 0.00 0.00 0.00 0.01 0.5 0.00 0.01 0.08 0.12 0.47 0.2 0.06 0.10 0.37 0.45 0.78

Distance (cM) 0.8 0.18 0.20 0.31 0.34 0.37 0.5 2.80 3.07 3.72 3.70 2.71 0.2 8.02 8.66 8.58 7.84 4.63

Table 4-5. Various Parameters estimation for the model with marker density of 4 cM.

Criteria of Model Selection Parameters Heritability AIC 1B^0.5 2B^0.9 BIC n^0.5

Mean 0.8 27.47 23.89 12.24 10.56 7.88 0.5 25.76 22.35 10.99 9.22 4.29 0.2 24.41 20.72 8.34 6.36 1.88

Alfa 0.8 2.43 1.99 0.53 0.32 0.00 0.5 2.22 1.79 0.43 0.25 0.00 0.2 2.06 1.60 0.28 0.14 0.00

Beta 0.8 0.00 0.00 0.00 0.00 0.02 0.5 0.00 0.00 0.05 0.10 0.46 0.2 0.01 0.01 0.24 0.35 0.77

Distance (cM) 0.8 0.55 0.61 0.81 0.85 0.90 0.5 2.61 3.05 4.82 4.87 3.46 0.2 5.37 6.68 10.62 10.05 5.39

−76−

Page 83: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 4-6. Various Parameters estimation for the model with sample size of 500.

Criteria of Model Selection Parameters Heritability AIC 1B^0.5 2B^0.9 BIC n^0.5

Mean 0.8 14.97 13.22 9.05 8.60 8.00 0.5 14.41 12.7 8.71 8.24 4.87 0.2 12.49 10.75 6.03 5.29 2.14

Alfa 0.8 0.87 0.65 0.13 0.07 0.00 0.5 0.80 0.59 0.11 0.06 0.00 0.2 0.61 0.43 0.06 0.03 0.00

Beta 0.8 0.00 0.00 0.00 0.00 0.00 0.5 0.00 0.00 0.02 0.03 0.39 0.2 0.05 0.09 0.31 0.37 0.73

Distance (cM) 0.8 0.01 0.02 0.04 0.04 0.05 0.5 1.20 1.36 1.66 1.70 1.33 0.2 6.06 6.48 6.42 6.02 4.20

3. Experimental Criteria

From above simulation, it is clear that the criteria are very important for model

selection practice. The loose criteria will increase the power of QTL detection, but at

the same time, the probability of false detected QTL will increase too. On the other

hand, the strict criteria can control the probability of false detected QTL but the

detection power of QTL will be hurt. The idea criterion should be the criterion that

can control the overall false QTLs in a reasonable low probability. One-way to do this

is to control the value of Alfa = 0.05, because the Alfa value was the overall rate of

false QTL detection for the whole genome (See Formula (4-2)). If we use this kind of

criterion, the detection power of QTL will be reasonable high and the average

distance between selected markers and the real QTLs will be optimised.

Now the question is how can we find this idea criterion. It is not easy because the

criterion is affected by various facts such as heritability, chromosome numbers,

marker density, and sample size etc. Because the complication and difficulty of

statistical derivation for the formula of the idea criterion, we here to conduct a large

scale of simulations and try to find the experimental criterion of model selection used

for inbred line QTL mapping practice. From above study, we believe that the BIC is a good criterion of model selection

in our situation, but it is not very accuracy. We suggest the following modification to

BIC in order to include the facts such as heritability, chromosome numbers, marker

density, and sample size.

−77−

Page 84: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

( ) ( ) MnpnpBM p ×+= lnˆln)( 2σ (4-9)

where M is a non-negative modifier for BIC criterion and the large M value means

strict criterion. M value is constructed by various facts such as heritability (H), marker

density (D), sample size (S), and chromosome numbers (C). All these facts are known

except the heritability. However, it is easy to obtain the estimation of the heritability

for a data set by regression all markers on the trait value (i.e. use R2).

Table 4-7 is an example of how to obtain the M value under the combinations of

the various facts. For this particular example, Heritability is 0.5, marker Density

(average marker distance) is 8 cM, Sample size is 300, Chromosome number is 3 and

the M value for Alfa = 0.05 is about 1.15 ((1.1+1.2)/2). The values of M under

different fact combinations have been summarized in Table 4-8 and Figure 4-2.

Table 4-7. Parameters estimation of Model-S with h2 = 0.5 by using modified BIC criteria.

Criteria 1B0.6 B0.7 B0.8 B0.9 BIC B1.1 B1.2 B1.3 B1.4 B1.5 Mean 10.29 9.42 8.71 8.20 7.71 7.32 6.97 6.67 6.43 6.20

Distance 3.39 3.46 3.43 3.48 3.43 3.40 3.34 3.32 3.25 3.15 Alfa 0.32 0.23 0.16 0.12 0.08 0.06 0.04 0.03 0.02 0.01 Beta 0.03 0.05 0.07 0.10 0.12 0.15 0.17 0.19 0.22 0.24

1B0.6 means M = 0.6.

By using multiple regression method, it is easy to find the relationship between M

and the various facts according to the data in Table 4-8. We propose following

multiple regression model:

εββββµ +++++= CSDHM 4321 (4-10)

where H is heritability, D is marker density, S is sample size, and C is chromosome

number. The parameters ( s) in Formula (4-10) can be estimated by using multiple

regression analyzing (SAS v6.12) according to the data in Table 4-8. The

experimental formula for M will be:

β

CSDHM 1.002.006.05.05.1ˆ +−−+= (4-11)

−78−

Page 85: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Table 4-8. The value of M under various parameters setting for α = 0.05.

h2 1Start 2D-16 D-4 D-2 3S-150 S-500 S-1000 4C-5 C-9 0.1 0.90 0.60 1.20 1.40 1.10 0.80 0.80 1.35 1.70 0.2 0.97 0.70 1.25 1.55 1.15 0.87 0.87 1.40 1.70 0.3 1.00 0.75 1.35 1.60 1.20 0.95 0.90 1.40 1.75 0.4 1.10 0.77 1.45 1.60 1.25 1.00 0.95 1.45 1.75 0.5 1.15 0.80 1.50 1.80 1.35 1.05 0.95 1.50 1.80 0.6 1.20 0.95 1.60 1.80 1.45 1.10 1.00 1.50 1.80 0.7 1.20 0.90 1.60 1.90 1.50 1.10 1.00 1.45 1.80 0.8 1.25 0.95 1.60 1.95 1.50 1.10 1.00 1.45 1.80 0.9 1.25 0.95 1.60 1.90 1.50 1.10 1.00 1.45 1.75

1Model-S with marker density = 8 cM, samples = 300 and chromosomes = 3. 2Marker density = 16 cM. 3 Sample size = 150. 4 chromosome number = 5.

T

case

of th

chro

�����������������������������������������

��������������������������������������������������������������������������������������������������������0.70

1.00

1.30

1.60

1.90

2.20

h0.1 h0.2 h0.3 h0.4 h0.5 h0.6 h0.7 h0.8 h0.9

���������������D-16

D-8

D-4

D-2

Figure 4-2. The value of M under various parameters setting for α = 0.05. Fromtop to bottom, marker density, samples, chromosomes and the solid line isModel-S.

����������������������������������

����������������������������������

����������������������������������������������������������������������������������������������������

0.70

1.00

1.30

1.60

1.90

2.20

h0.1 h0.2 h0.3 h0.4 h0.5 h0.6 h0.7 h0.8 h0.9

S-150

S-300���������������S-500

S-1000

������������������������������������������������������������������������������������

�������������������������������������������������������������������

0.70

1.00

1.30

1.60

1.90

2.20

h0.1 h0.2 h0.3 h0.4 h0.5 h0.6 h0.7 h0.8 h0.9

S-150

S-300���������������S-500

S-1000

able 4-9 is an example of using this experimental criterion in two simulation

s. For the first case, we set 9 QTLs into 3 chromosomes; the numbers (positions)

e QTLs are 4 (16,40,72,104) for the first chromosome, 2 (8, 32) for second

mosome, and 3 (8, 40, 80) for the last chromosome. The heritability is 0.6,

−79−

Page 86: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

marker density is 4 cM, and sample size is 200. In the second case, heritability is 0.82,

marker density is 10 cM, and sample size is 185. There are 7 QTLs distributed along

the 5 chromosomes and the numbers (positions) are 2 (30,70), 3 (10,40,80), 0, 0, 2

(20,50). The experimental criterion (formula 4-9 and 4-11) works fine in these two

cases by controlling the average Alfa value at 0.05 levels. The QTLs detect power is

considerably high. However, the sample variances are quite high, especially for

parameter Alfa. It may be caused by the small sample size. To increase the sample

size will reduce the sample variance for Alfa (data not showed).

Table 4-9. Estimation of parameters in model selection by using the experimental criterion.

Cases 1M QTLs 2n 3SD α SD β SD χ SD First 1.66 9 6.9 1.7 0.048 0.076 0.28 0.12 6.1 4.2

Second 1.58 7 7.4 0.9 0.058 0.095 0.00 0.00 0.3 0.7 1See Formula (13). 2Average identified QTL numbers, 3The standard deviation,

5. Conclusions and Discussion

5.1 Conclusion

One of the goals of this dissertation is to explore and compare the major QTL

mapping methods through simulation study. Single marker analysis is the simplest

method for QTL mapping practice. It can be implied from the simulation study, by

using single marker method, the markers near QTLs have the highest significant level.

However, the nearby markers can also have very high significant level too when the

QTL effect is large or there are more than on QTLs existed in a chromosome.

Therefore, it is the method of QTL “detection”, not location.

Unlike single marker analysis, IM method uses two markers to construct a testing

interval to search for QTL and it has the ability to use the estimated genetic map to

locate QTL and estimate QTL effect at same time. CIM method is the extension of IM

method and the CIM method included the extra markers into the model for controlling

the background genetic variation. MCIM method is similar to CIM method except

−80−

Page 87: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

that MCIM method considers the markers for background control having random

effects and uses mixed linear model approach.

Under the simple additive QTL model and DH population, our simulation results

indicated that IM methods perform quite poor both in the QTL detection power and

the possibility of false QTL detection. The performance of MCIM method is

reasonably good while CIM method has the lowest possibility of false QTL detected

and the highest power of QTL detection in most cases. However, unlike CIM method,

the change of marker numbers for background control on MCIM method has little

impact on the result of QTL mapping work. CIM method gained big improvement for

increasing the sample size while MCIM method obtained the benefit for increasing

the marker density as implied by the simulation studies.

It cannot be denied that QTLs are assumed to act additively is a great

simplification of reality. Many research supply strong evidence for QTL by

environment interaction and epistatic interactions between QTLs (Long et al. 1995).

For IM and CIM methods, the simulation studies implied that the estimation of QTL

main effects can be obtained unbiased by using data for all environments together.

However, it is difficult to obtain the estimation of QE interaction effects, even by

doing QTL mapping on the data for different environment separately. In this case,

some of the QTLs may not be located and will be missed. MCIM method has the

ability to put all QTL main effects and QE interaction effects into the mixed linear

model and obtained the unbiased estimation of main and QE effects as indicated by

the simulation study work.

MCIM method can use mixed linear model for mapping QTLs with marginal and

epistatic effects. The simulation study has indicated that MCIM method can obtain the

unbiased estimation of QTL marginal and epistatic effects. Although IM and CIM

have the ability to get the unbiased estimation of marginal QTL effects when the QTL

epistatic effects are existed, the variance for marginal effects estimation will increase

largely too. On the other hand, the detection power of QTLs will go down and the

probability of false QTL detection will go up apparently, especially for the CIM

method as the simulation study indicated.

−81−

Page 88: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

MIM is a multiple intervals oriented method for QTL mapping. MIM method can

obtain the QTL information including number, positions, effects, and interaction of

significant QTLs simultaneously and having the ability to analyse the genetic

architecture for an experimental species. There are two crucial tasks, evaluation

procedure and search strategy, need to be fulfilled before MIM method can be

workable. Kao and Zeng (1999) have finished the first task, which is the algorithm for

analysing the likelihood of the data given a genetic model (QTL number, positions

and epistasis of QTL). However, the second task, which is to search and select a

genetic model among the parameter space, is very complicated and the problem of

criteria for model selection is not solved completely yet.

Another goal of this dissertation is to check the performance for various criteria of

model selection under the QTL mapping framework through the simulation studies.

We have also defined a set of parameters for describing the degree of fitness between

the selected model and the true model and proposed an experimental criterion of

model selection, which can be used for QTL mapping work, such as MIM. The

experimental criterion is a modification of BIC by adding relevant facts such as

heritability, marker density, sample size, and chromosome numbers. The experimental

criterion can control the type I and type II errors at a reasonable level and is more

precise than the famous BIC criteria under the QTL mapping situation. The

experimental criterion of model selection works fine in our simulation cases.

There are a number of very important issues that have been neglected in this thesis.

We conclude our discussion with brief statements on several of these issues.

5.2 Threshold and Criteria

Unless the QTLs are very close, on the known QTL positions, the estimation of

QTL effects is unbiased for the methods of using one interval such as IM, CIM and

MCIM as indicated by the simulation studies. However, in the real QTL mapping

practice, the QTL positions are unknown. Therefore, the priority of QTL mapping

work is to find the evidence of QTL according to the LR value. Threshold is a

predefined value and if the LR value exceeds the threshold, a QTL will be declared

−82−

Page 89: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

and the position and effect can be estimated easily. It is clear that the threshold value

is important because a high value of threshold will decrease the detect power of QTLs

and lower the probability of false QTL detection at the same time. On the other hand,

the low value of threshold will do the opposite. Therefore, a high value of threshold is

needed if the purpose of the QTL mapping experiment is to find the precise position

for QTL clone and the low value of threshold is a appropriated one if the purpose of

the experiment is to find as many QTLs as possible.

The appropriate value of threshold for IM method can be obtained theoretically

because IM method is a simple one-QTL model. For DH or Backcross population, the

distribution for the maximum of LR on the whole marker interval is between and

, more close to for relatively small interval (< 10 cM) and the LOD threshold

is between 2 and 3 for many organisms. However, the threshold can be affected by

many experimental factors such as sample size, genome size, marker density, missing

data, and segregation distortion etc. Therefore, for particular experiment, the threshold

coming from the theory is not accurate sometimes. One way to obtain the threshold

value according to the specific experimental data set is using permutation test. The

approach to the estimation of a significance threshold is based upon the simple

observation of marker-phenotype association.

21χ

22χ

21χ

Unlike the simple situation in IM method, which is simply testing the hypothesis

of no QTL or just one QTL, the possible existence of several QTLs in one

chromosome has to be considered by the CIM and MCIM methods. The derivation of

the threshold theoretically will be complicated by the multiple-testing problem. On

the other hand, the theoretical bases for the permutation test for CIM or MCIM

method is questionable because the difficulty to permute the selected markers for

control of background variance. Therefore, how to obtain the reasonable threshold

value for CIM and MCIM methods under the multiple-QTLs situation is still an open

question. However, our simulation studies indicate that the LOD = 2.5 is a reasonable

threshold value according to the simulation results, especially the power of QTL

detection and the probability of false QTL detection.

−83−

Page 90: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

In addition to the difficulty of choosing an appropriate threshold value, the

problem of criteria (stopping rules) for model selection is critical for the QTL

mapping methods by using multiple intervals (multiple QTLs). MCIM method uses

two-dimensional search strategy for mapping QTLs with epistatic effects. One of the

problems is that there are too many possible pairwise intervals to be considered and

the appropriate selecting methods should be used for reducing the analysing time.

Another difficult question is that how to decide which pairwise interval is the one we

are looking for and this is the problem of criterion for model selection. The problems

of searching strategy and criteria for model selection are even complicated for MIM

method, which uses multiple intervals at same time. In this thesis, we use the

approach of simulation study to deal with these problems and proposed the

experimental criterion for model selection under the QTL mapping framework. This

criterion is useful as indicated by the simulation studies.

5.3 Software Design

The QTL mapping methods such as MIM, MCIM, CIM, IM and even One Marker

Analysis are very complicated in algorithm and time consumed in computation. The

appropriate computer software is important for analysing the data of QTL mapping

experiment. Therefore, to develop good computer software is one of the important

issues for QTL mapping work. The definition of the good computer software for QTL

mapping includes several points. Having the ability to compute fast and accurate is

the most import points for any kinds of software. Besides that, due to the complicated

nature of QTL mapping in experimental data structure and computation algorithm, it

is also very critical to have a user-friend interface for these kinds of software. Another

important consideration for software development is the result presentation. Sometime,

it is not sufficient to represent the QTL mapping results just in raw data format or

tables. To present the result in graphic (visualization) is very useful and can provide

more information for the experimental organism.

−84−

Page 91: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Nowadays, The most popular computer software of QTL mapping include

Mapmaker/QTL (Lander et al. 1987), which is based on the IM method of Lander and

Botstein and QTL Cartographer (Basten et al. 1994), which is based on Zeng’s CIM

method. The above software works very well in the sense of computation. However,

the user-interface for these kinds of software is not very friendly and sometime it is

difficult to handle or use. Moreover, these kinds of software lack the function to

visualize the results of QTL mapping work. Although the result obtained by QTL

Cartographer can be visualized when using the general-purpose graphic software

“gnuplot”. However, the major drawback is that the graphic functions of “gnuplot”

are limited because the software is not designed particular for QTL mapping purpose.

This is the motivation of developing the QTL mapping software with user-friend

interface and powerful graphic presentation ability for the mapping results. The

software is called Windows QTL Cartographer and the functions are based on QTL

Cartographer. It has many uses and been posted on the Internet: (http://statgen.ncsu.

edu/ qtlcart/cartographer.html).

Nowadays, personal computer begins more popular and powerful. The 32 bits

windows operation systems, such as Windows 98 and Windows NT, are the most

popular and efficient systems working on the PC environment. Therefore, it is nature

to develop the new version of QTL Cartographer under PC and Windows operation

system environment. On the other hand, there is several software development

systems exist under Windows environment. These systems include JAVA, Visual

Basic, and Visual C++ etc. The Visual C++ system is the extension of C/C++

language and the C/C++ language is the most powerful and popular program language

on the software market. Visual C++ has added thousands more functions designed for

windows program. Because MFC encapsulate most windows’ functions and can

produce the sceptical program automatically. Here we develop the software by using

MFC under Visual C++ software development environment.

The function of the software can be roughly divided into three main parts – source

data management, data analysis, and the result output and visualization.

- Source Data Management

−85−

Page 92: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

The user interfaces for the source data has a great influence on the use of the

software. By using the user-friendly interface, users can organize and input the source

data easily and run the software very quickly.

For the convenience of constructing the input file, a “new file” function has been

implemented. The function uses menu and dialog boxes to guide the user to quick

build the input file and that is much easier than editing the raw data and the tokens.

The ‘Import’ function of the software has been designed for the convenience of the

user. By using this function, people can use the source files of QTL mapping

experiments in MAPMAKER / QTL or QTL Cartographer formats directly.

The software can use ‘Open’ function to load the source data from the “mcd”

format file (the example is included in the software). A limited edit function can be

applied on the displayed file such as delete and insert a character. If the file has been

modified, the ‘Verity’ function should be used again for the error checking. To use the

‘Save’ function to keep the modification permanently and it is also possible to use the

‘Save to’ function to save the file in different file name. It is possible to view the

source data in an organized way by using the “View SrcData” function on the menu.

Through the function, you can view the marker genotype data for each chromosome

and trait values in a nice way.

- Mapping Functions

After the source data is loaded and verified, it is critical to have a good method or

algorithm to analysis data and produce the correct result. ‘QTL Cartographer’ has

been written and used by many people for several years. Therefore, the software of

Windows QTL Cartographer called the relative programs of ‘QTL Cartographer’

directly from inside the software for the purpose of calculation for various QTL

Mapping methods. However, by using the user-friendly interfaces and dialog boxes, it

is easy to change the parameters setting for various QTL mapping methods and

therefore this software is much easier to use.

The implemented functions include Statistical Summary for the raw data, Single

Marker Analysis, Interval Mapping Method, and Composite Mapping Method. The

dialog box for Interval Mapping Method and Composite Mapping method are very

−86−

Page 93: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

similar, and the only difference is that there is no ‘Control’ button in the dialog box

for Interval Mapping Method. The ‘Result File’ button is used for indicating the result

filename. To use the ‘Walk Speed’ spin control to set the QTL searching step that is

the distance (in cM) between two testing points for calculating the LR score along the

whole genome. For the chromosome, it is possible to test all chromosomes or just test

one of the chromosomes that can be set by the spin control. That is also true for the

trait selection and only after selecting one trait, the threshold value for this trait can be

inputted or calculated through permutation test.

When all the parameters have been set, QTL mapping analysis can be started by

clicking the ‘OK’ button. In the middle of the process, several DOS windows may

pop out caused by the direct calling for the programs of QTL Cartographer software.

The best way to deal with is to minimize these windows by clicking the minimize

button in the title bar. The result file will be loaded automatically into the software

and the relative graphic will be showed in the graphic dialog box immediately.

- Result visualization

The result file is a text format file with the extension filename of “qrt” and it can

also be opened as the graphic view file by the software of Windows QTL

Cartographer. Many functions can be used through a menu in the graphic view

window and these functions include File, Chrom, Traits, Effect, and Setting. By using

‘File’ item, user can open a new result file, copy the graphic image into the clipboard,

and exit the view window. The ‘Chrom’ and ‘Traits’ items can be used for selecting

one or several chromosomes and traits for display. The estimation of QTL additive

and dominance effects can be displayed by using ‘Effect’ item. For simulation study,

the QTL effects and positions can also be showed out by open a ‘qtl’ file. Through the

‘Setting’ item, user can adjust various properties of the graphic such as changing the

color, setting the threshold value, showing marker names, and drawing the

coordination for various positions etc.

−87−

Page 94: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

Reference

[1] Abler, B. S. B., M. D. Edwards, C. W. Stuber 1991. Isoenzymatic identification

of quantitative trait loci in crosses of elite maize inbreds. Crop Science 31:

267-274.

[2] Andersson, L. et al. 1994. Genetic mapping of quantitative trait loci for growth

and fatness in pigs. Science 263: 1771-1774.

[3] Aldhous, P. 1994. Fast tracks to disease genes. Science 265: 2008-2010.

[4] Baes P., and P. Van Cutsem 1993. Electrophoretic analysis of eleven isozyme

systems and their possible use as biochemical markers in breeding of Chicory

cichorium-intybus L. Plant Breeding 110: 16-23.

[5] Basten, C. J., S. B. Zeng and B. S. Weir 1994. Zmap – A QTL cartographer.

Proceedings of the 5th world congree on genetics applied to livestock production,

22: 65-66.

[6] Beckmann, J. S. and M. Soller 1983. Restriction fragment length polymorphisms

in genetic improvement: methodologies, mapping and cost. Theor. Appl. Genet.

67: 35-43

[7] Beckmann, J. S. and M. Soller 1986a. Restriction fragment length polymorphisms

and genetic improvement in agricultural species. Euphytica 35: 111-124.

[8] Beckmann, J. S. and M. Soller 1986b. Restriction fragment length

polymorpphisms in plant genetic improvement. Oxford Surveys Plant Mol. Cell

Biol. 3: 196-250.

[9] Botstein, D., R. L. White, M. Skolnick, and R. W. Davis 1980. Construction of a

genetic linkage map in man using restriction fragment length polymorphisms.

Am. J. Hum. Genet. 32: 314-331.

[10] Bradshaw, H. D. and R. F. Stettle 1995. Molecular genetics of growth and

development in populus. IV. Mapping QTLs with large effects on growth, farm,

and phenology traits in a forest tree. Genetics 139: 963-973.

−88−

Page 95: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[11] Breiman, L. 1995. Better Subset Regression Using the Nonnegative Garrote.

Technometrics 37: 373-384.

[12] Burr, B. F., A. Burr, K. H. Thompson, M. C. Albertson and C. Stuber 1988.

Gene mapping with recombinant inbreds in maize. Genetics 118: 519-526.

[13] Buetow, K. H. and A. Chakravarti 1987a. Multipoint gene mapping using

seriation. I. General methods. American Journal of Human Genetics 41:

180-188.

[14] Buetow, K. H. and A. Chakravarti 1987a. Multipoint gene mapping using

seriation. II. Analysis of simulated and empirical data. . American Journal of

Human Genetics 41: 189-201.

[15] Cause, M. A., T. M. Fulton, Y. G. Cho, S. N. Ahn, J. Chunwongse, et al. 1994.

Saturated molecular map of the rice genome based on an interspecific backcross

population. Genetics 138: 1251-1274.

[16] Chase, K., F. R. Adler, K. G. Lark 1997. Epistat: A Computer Program for

Identifying and Testing Interactions Between Pairs Of Quantitative Trait Loci.

Theor. Appl. Genet. 94: 724-730.

[17] Churchill, G. A., and R. W. Doerge 1994. Empirical threshold values for

quantitative trait mapping. Genetics 138: 963-971.

[18] Cockerham, C. C. 1954. An extension of the concept of partitioning hereditary

variance for analysis of covariances among relatives when epistatsis is present.

Genetics 39: 859-882.

[19] Comstock, R. E. 1978. Quantitative genetics in maize breeding. pp 191-206. In

maize breeding and genetics, New York.

[20] Delourme, R., and F. Eber 1992. Linkage between an isozyme marker and a

restorer gene in radish cytoplasmic male sterility of rapeseed (Brassica napus L.).

Theor. Appl. Genet. 85: 222-228.

[21] Doerge, R. W. 1993. Statistical methods for locating quantitative trait loci with

molecular markers. Ph. D. dissertation, Dept. Statistics, NCSU, Raleigh.

[22] Doerge, R. W. and G. A. Churchill 1995. Permutation tests for multiple loci

affecting a quantitative character. Genetics 142: 285-294.

−89−

Page 96: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[23] Doris-Keller, H. at al. 1987. A genetic linkage map of the human genome. Cell

51: 319-337.

[24] East, E. M. 1916. Studies on size inheritance in Nicotiana. Genetics 1: 164-176

[25] Eberhart, S. A., R. H. Moll, H. F. Robinson and C. C. Cockerham. 1966.

Epistatic and other genetic variances in two varieties of Maize. Crop Science 6:

275-280.

[26] Edwards, M. D., C. W. Stuber, J. F. Wendel 1987. Molecular-marker-facilitated

investigations of quantitative trait loci in maize. I. Numbers, genomic

distribution and types of gene action. Genetics 116: 11-125.

[27] Falconer, D. S. 1996. Introduction to quantitative genetics. Ed. 4. Longman.

New York.

[28] Frank, I., Friedman 1993. A Statistical View of Some Chemometrics

Regression Tools. Technometrics 35: 109-135.

[29] Frankel, W. N. 1995. Taking stock of complex trait genetics in mice. Trends

Genet. 11: 471-477.

[30] Gai J.-Y., J.-K. Wang 1998. Identification and estinmation of a QTL model and

its effects. Theor Appl Genet 97: 1162-1168.

[31] Goffinet B., B. Mangin 1998. Comparing methods to detect more than one QTL

on a chromosome. Theor Appl Genet 96: 628-633.

[32] Haldane, J. B. S. 1919. The combination of linkage values and the calculation of

distance between the loci of linked factors. Journal of Genetics 8: 299-309.

[33] Haley, C. S., S. A. Knott 1992. A Simple Regression Method for Mapping

Quantitative Trait Loci in Line Crosses Using flanking Markers. Heredity 69:

315-324.

[34] Hallden, C., A. Hjerdin, I. M. Rading, T. Sall, and B. Fridlundh, et al. 1996. A

high density RELP linkage map of sugar beet. Genome 39: 634-645.

[35] Halward, T., H. T. Stalker and G. Kocher 1993. Development of an RFLP

linkage map in diploid peanut species. Theor. Appl. Genet. 87: 379-384.

−90−

Page 97: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[36] Hamalaine, J. H., K. N. Watanabe, J. P. T. Valkonen, A. Arihara, R. L. Plaisted,

et al. 1997. Mapping and marker asssisted selection for a gene for extreme

resistance to potato virus Y. Theor. Appl. Genet. 94: 192-197.

[37] Hartley, H. D. and J. N. K. Rao 1967. Maximum-likelihood estimation for the

mixed analysis of variance model. Biometrika, 54: 93-108.

[38] Hoeschele, I., P. Uimari, F. E. Grignola, Q. Zhang and K. M. Gage 1997.

Advances in statistical methods to map quantitative trait loci in outbred

populations. Genetics 147: 1445-1457.

[39] Jansen, R. C. 1992. A general mixture model for mapping quantitative trait loci

by using molecular markers. Theor. Appl. Genet. 85: 252-260.

[40] Jansen, R. C. 1993. Interval mapping of multiple quantitative trait loci. Genetics

135: 205-211.

[41] Jansen, R. C. 1994. Controlling the Type I and Type II Errors in Mapping

Quantitative Trait Loci. Genetics 138: 871-881.

[42] Jansen R. C., D. L. Johnson., J. A. M. Van Arendonk 1998. A Mixture Model

Approach to the Mapping of Quantitative Trait Loci in Complex Populations

With an Application to Multiple Cattle Families. Genetics 148: 391-399.

[43] Jiang, C.-J., Z.-B. Zeng, 1995. Multiple Trait Analysis of Genetic Mapping for

Quantitative Trait Loci. Genetics 140: 1111-1127.

[44] Jiang, C.-J., Z.-B. Zeng, 1997. Mapping Quantitative Trait Loci With

Dominant And Missing Markers In Various Crosses From Two Inbred Lines.

Genetica 101: 47-58.

[45] Johannsen, W. 1909. Elemente der exakten erblichkeitsliehre. Fisher, Jena.

[46] Josee D. and D. Siegmund 1999. Statistical Method for Mapping Quantitative

Trait Loci From a Dense Set of Markers. Genetics 151: 373-386.

[47] Kao, C.-H., Z.-B. Zeng 1997. General Formulas For Obtaining The MLEs

And The Asymptotic Variance-Covariance Matrix In Mapping Quantitative

Trait Loci When Using The EM Algorithm. Biometrics 53: 653-665.

[48] Kao, C.-H., Z.-B. Zeng, and R. Teasdale 1999. Multiple interval mapping for

quantitative trait loci. Genetics 152: 1023-1216.

−91−

Page 98: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[49] Kindiger B., and R. A. Vierling 1994. Comparative isozyme polymorphisms of

North American eastern gamagrass, Tripsacum dactyloides var. dactyloides and

maize, Zea mays L. Genetica 94: 77-83.

[50] Knott, S. A., C. S. Haley, 1992. Aspects of maximum likelihood methods for the

mapping of quantitative trait loci in line crosses. Genet. Res. 60: 139-151.

[51] Kundu, D., G. Murali, 1996. Model Selection in Linear Regression.

Computational Statistics & Data Analysis 22: 461-469.

[52] Lagercrantz, U., and D. J. Lydiate 1996. Comparative genome mapping in

Brassica. Genetics 144: 1903-1910.

[53] Lander, E. S., P. Green, I. Abrahamson, A. Barlow, M. J. Daly, et al. 1987.

Mapmaker: an interactive computer package for constructing primary genetic

linkage maps of experimental populations. Genomics 1: 182-195.

[54] Lander, E. S., D. Botstein. 1989. Mapping Mendelian factors underlying

quantitative traits using RFLP linkage maps. Genetics 121: 185-199.

[55] Lander, E. S. 1993. Finding similarities and differences among genomes. Nature

Genetics 4: 5-6.

[56] Lee, G. H., L. M. Bennett, R. A. Carabeo, and N. R. Drinkwater 1995.

Identification of Hepatocarcinogen-resistence genes in DBA/2 mice. Genetics

139: 387-395.

[57] Lee, M. 1995. DNA markers and plant breeding programs. Adv. Agron. 55:

265-344

[58] Li, Z. K., S. R. M. Pinson, M. A. Marchetti, J. W. Stansel and W. D. Park 1995.

Characterization of quantitative trait loci contributing to field resistance ot

sheath blight (Rhizonctonia solani) in rice. Theor. Appl. Genet. 91: 382-388.

[59] Liu, B. H. and S. J. Knapp 1992. QTLSTAT 1.0, a software for mapping

complex trait using nonlinear models. Oregon state university.

[60] Liu, B. H. and S. J. Knapp 1997. Computational tools for study of complex traits,

pp. 43-79 in Molecular dissection of complex traits. edited by A. H. Paterson,

CRC Press LLC, Boca Raton, Florida.

−92−

Page 99: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[61] Long, A. D., S. L. Mullaney, L. A. Reid, J. D. Fry, C. H. Lanley and T. F. C.

Mackay 1995. High resolution mapping of genetic factors affecting abdominal

bristle number in Drosophila melanogaster. Genetics 139: 1273-1291.

[62] Lu, Y. Y. and B. H. Liu 1995. PGRI, a software for plant genome research.

Plant Genome III conference (Abstrct): 105, San Diego, CA.

[63] Lynch, M., B. Walsh, 1998. Genetics and Analysis of Quantitative Traits.

Massachusetts: Sinauer Associates, Inc.

[64] Mallows, C. L. 1995. More Comments on Cp. Technometrics, 37: 362-372.

[65] Mangin, B., B. Coffinet and A. Rebai 1994. Constructing Confidence Intervals

for QTL Location. Genetics 138: 1301-1308.

[66] Manly, K. F. and E. H. Cudmore, Jr. 1996. New version of MAP manager

genetic mapping software. Plant Genome IV (Abstract): 105.

[67] Martinez, Q., R. N. Curnow 1992. Estimation the Locations and the sites of

the Effects of Quantitative Trait Loci Using Flanking Markers. Theor. Appl.

Genet. 85: 480-488.

[68] Nienhuis, J., T. Helentjaris, M. Slocum, B. Ruggero and A. Schaefer 1987.

Restriction fragment length polymorphism analysis of loci associated with insect

resistance in tomato. Crop Science 27: 791-803.

[69] Nilsson-Ehle, H. 1909. Kreuzunguntersuchungen an hafer und weizen. Lund.

[70] Paterson, A. H., S. Damon, J. D. Hewitt, D. Zamir, H. D. Rabinowitch, S. E.

Lander and S. D. Tanksley 1991. Mendelian factors underlying quantitative

traits in tomato: comparison across species, generations, and environments.

Genetics 127: 181-197

[71] Plomin, R., G. E. McClearn and G. Gora-Maslak 1991. Quantitative trait loci

and psychopharmacology. Journal of Psychopharmacology 5 1-9.

[72] Ragot, M., and D. A. Hoisington 1993. Molecular markers for plant breeding:

comparisons of RELP and RAPD genotyping costs. Theoor. Appl. Genet. 86:

957-984.

[73] Rao, C. R 1971. Estimation of variance and covariance components MINQUE

theory. Journal of multivariate analysis 1: 257-275.

−93−

Page 100: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[74] Rao, P. S. R. S. 1997. Variance components estimation: mixed models,

methodologies and applications (Monographs on statistics and applied

probability 78). Chapman and Hall, London.

[75] Rasmusson, J. M. 1933. A contribution to the theory of quantitative character

inheritance. Hereditas 18: 245-261.

[76] Rebai, A., B. Coffinet and B. Mangin 1994. Approximate thresholds of interval

mapping test for QTL detection. Genetics 138: 235-240.

[77] Rebai, A., B. Coffinet and B. Mangin 1995. Comparing power of different

methods for QTL detection. Biometrics 51: 87-99.

[78] Reiter, R. S., J. G. Cors, M. R. Sussman and W. H. Gabelman 1991. Genetic

analysis of tolerance to low-phosphorus stress in maize using restriction

fragment length polymorphisms. Theor. Appl. Genet. 82: 561-568.

[79] Reiter, R. S., J. G. K. Williams, K. A. Feldmann, J. A. Raflski, S. V. Tingey and

P. A. Scolnik 1992. Global and local genome mapping in Arabidopsis thialiana

by using recombinant inbred lines and random amplified polymorphic DNAs.

Proc. Natl. Acad. USA 89: 1477-1481.

[80] Sax, K. 1923. The association of size differences with seed-coat pattern and

pigmentation in Phaseolus vulgaris. Journal of Theoretical Biology 117: 1-10.

[81] Searle S. R. 1970. Large sample variances of maximum likelihood estimators of

variance components using unbalanced data. Biometrics 26: 505-524.

[82] Shao, J. 1996. Bootstrap Model Selection. Journal of the American Statistical.

Association 91: 655-665.

[83] Sillanpaa M. J., E. Arjas 1998. Bayesian Mapping of Multiple Quantitative Trait

Loci From Incomplete Inbred Line Cross Data. Genetics 148: 1373-1388.

[84] Simon, C. J., and F. J. Muehlbauer 1997. Construction of a chickpea linkage

map and its comparison with maps of a pea and lentil. Journal of Heredity 88:

115-119.

[85] Soller, M., and J. S. Beckmann 1988. Genomic genetics and the utilization for

breeding purposes of genetic variation between populations. The second

−94−

Page 101: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

international conference on quantitative genetics, pp 161-188. Sinauer Assoc.,

Sunderland, MA.

[86] Stuber, C. W., M. D. Edwards, J. F. Wendel 1987. Molecular marker facilitated

investigations of quantitative trait loci in maize II. Factors influencing yield and

its component traits. Crop Science 27: 639-648.

[87] Stuber, C. W., S. E. Lincoln, D. W. Woff, T. Helentjaris, and E. S. Lander 1992.

Identification of genetic factors contributing to heterosis in a hybrid from two

elite maize inbred lines using molecular marker. Genetics 132: 832-839.

[88] Tanksley, S. D., and C. M. Rick 1980. Isozymic gene linkage map of the tomato:

Applications in genetic and breeding. Theor. Appl. Genet. 57: 161-170.

[92] Tibshirari, R. 1996. Regression shrinkage and Selection via the Lass. Journal

of the Royal Statistical Society Series B, 58: 267-288.

[89] Thoday, J. M. 1961. Location of polygenes. Nature 191: 368-379.

[90] Thompson, E. A. 1984. Information gain in joint linkage analysis. IMA Journal

of Mathematical Applied Medical Biology 1: 31-49.

[91] Thompson, E. A., T. R. Meagher 1998. Genetic linkage in the estimation of

pairwise relationship. Theor Appl Genet 97: 857-864.

[93] Van Ooijen, J. W. and C. Maliepaard 1996. MAPQTL version 3.0: Software for

the calculation of QTL position on genetic map. Plant Genome IV (Abstract):

105.

[94] Viruel, M. A., R. Messeguer, M. C. De Vicente, J. Garcia Mas, P.

Puigdomenech, et al. 1995. A linkage map with RELF and isozyme markers for

almond. Theor. Appl. Genet. 91: 964-971.

[95] Visscher P. M., R. Thompson and C. S. Haley 1996. Confidence intervals in

QTL mapping by bootstrapping. Genetics 143: 1013-1020.

[96] Wang D., J. Zhu, Z. Li, A. H. Paterson (1999). Mapping QTLs with epistatic

effects and QTL × environment interactions by mixed linear model approaches.

Theor. Appl. Genet., 99:1255-1264.

−95−

Page 102: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

[97] Wang S. and X. Yan 1998. Simulation data generation method and software

design for mixed linear model. Journal of Zhejiang Agricultural University

24(2):135-140

[98]

Wang, S., C. Basten, and Z.-B Zeng 1999. Windows QTL Cartographer.

Department of Statistics, North Carolina State University, Raleigh, NC.

(http://statgen.ncsu.edu/qtlcart/WQTLCart.htm).

[99] Weeks, D. and K. Lange 1987. Preliminary ranking procedures for multillocus

ordering. Genomics 1: 236-242.

[100] Weller, J. I. 1996. Maximum likelihood techniques for the mapping and

analysis of quantitative trait loci with the aid of genetic marker. Biometrics 42:

627-640.

[101] Williams, J. G. K. et al. 1990. DNA polymorphisms amplified by arbitrary

primers are useful as genetic markers. Nucl. Acids Res. 18: 6531-6535.

[102] Wu W.-R., W. M. Li 1995. Model Fitting and Model Testing in the Method

of Joint Mapping of Quantitative Trait Loci. TAG 92: 477-482.

[106] Zeng, Z.-B. 1993. Theoretical Basis Of Separation Of Multiple Linked

Gene Effects On Mapping Quantitative Trait Loci. Proc. Natl. Acal. Sci. USA

90: 10972-10976.

[103] Xu, G. W., C. W. Magill, K. F. Schertz and G. E. Hart 1994. A RFLP linkage

map of sorghum bicolor (L.) Moench. Theor. Appl. Genet. 89: 139-145.

[104] Xu, Y. B., and L. H. Zhu 1994. Molecular quantitative genetics. China

Agriculture Press, Beijing.

[105] Zeng, Z.-B. 1992. Correcting the Bias of WRIGHT’s Estimates of the

Number of Genes Affecting a Quantitative Character: A Further Improved

Method. Genetics 131: 987-1001.

[107] Zeng, Z.-B. 1994. Precision Mapping of Quantitative Trait Loci. Genetics

136: 1457-1468.

[108] Zhu, J. and B. S. Weir 1996. Diallel analysis for sex-linked and maternal

effects. Theor. Appl. Genet. 92: 1-9.

−96−

Page 103: Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection

−97−

[109] Zhu J 1998. Mixed model approaches for mapping complex quantitative trait

loci, pp. 11-20, in Proceedings of the China National Conference on Plant

Breeding, edited by L. Z. Wang and J. R. Dai, Agricultural Science and

Technology Press of China, Beijing, China.

[110] Zhu J., and B. S. Weir 1998. Mixed model approaches for genetic analysis of

quantitative traits. pp. 321-330, In Proceedings of the International Conference

on Mathematical Biology, edited by L. S. Chen, S. G. Ruan, and J. Zhu, World

Scientific Publishing Co., Singapore.

[111] Zhu J 1999. Mixed model approaches of mapping genes for complex

quantitative traits. Journal of Zhejiang University (Natural Science), 33(3):

327-335

[112] Zhu J 1999. Principle of Linear Model Analysis. Scientific Publishing House,

Beijing, China