Faculty of Resource Science and Technology SNP DISCOVERY AND...i ACKNOWLEDGEMENTS First, I would like to express my sincere appreciation and deepest gratitude to my obliging supervisor,

1

GENE-ASSOCIATED SNP DISCOVERY AND MOLECULAR CLONING

OF FULL-LENGTH cDNA OF CINNAMATE 4-HYDROXYLASE AND

CINNAMYL ALCOHOL DEHYDROGENASE IN A TROPICAL TIMBER

TREE Neolamarckia cadamba

Tchin Boon Ling

Master of Science

(Plant Biotechnology)

2013

Faculty of Resource Science and Technology

2

GENE-ASSOCIATED SNP DISCOVERY AND MOLECULAR CLONING OF FULL-

LENGTH cDNA OF CINNAMATE 4-HYDROXYLASE AND CINNAMYL ALCOHOL

DEHYDROGENASE IN A TROPICAL TIMBER TREE Neolamarckia cadamba

Tchin Boon Ling

A thesis submitted in fulfilment of the requirements for the degree of Master of Science

Faculty of Resource Science and Technology

UNIVERSITI MALAYSIA SARAWAK

2013

i

ACKNOWLEDGEMENTS

First, I would like to express my sincere appreciation and deepest gratitude to my obliging

supervisor, Dr. Ho Wei Seng, Faculty of Resource Science and Technology, UNIMAS, who

gave me this golden opportunity to carry out my master research in the Forest Genomics and

Informatics Laboratory (fGiL) and thanks for his patience, valuable advice and

encouragement which enabled me to successfully complete my study.

I would also like to thanks my co-supervisor, Dr. Pang Shek Ling, researcher from

Sarawak Forestry Corporation (SFC), for her continuous guidance along the research and gave

me many insights into the techniques of research and problems that arise.

My sincere thanks also go to Assoc Prof. Dr. Ismail Jusoh, Department of Plant

Science and Environmental Ecology, Faculty of Resource Science and Technology, UNIMAS,

who provided me with necessary information and technique on wood properties study.

My truthful gratitude to our lab assistant, Miss Kamalia and all labmates in the Forest

Genomics and Informatics Laboratory (fGiL), UNIMAS, especially Mr. Liew Kit Siong for

his caring, advices and support. Without his assistance, it would have been impossible to

complete my study within the stipulated period.

Finally, I would like to thank my family, friends and everyone else who directly or

indirectly contributed to my study, especially my lovely family members who always caring

and supporting me throughout my study in UNIMAS.

ii

ABSTRACT

Neolamarckia cadamba, or locally known as Kelampayan is one of the indigenous tree

species that are selected for forest plantation establishment in Sarawak due to the high

productivity and short rotation time of this species. Understanding the structure and

composition of Kelampayan wood through genome integration is vital to better utilize this

wood material. Concurrently, the Kelampayan wood formation genomic resource database,

aka Cadamomics (10,368 ESTs) has been developed, and this opens the gateway for

researchers to deeply explore the genomic basic of Kelampayan. EST database is a useful

resource for gene discovery. Further analysis on this database generated two full-length lignin

biosynthesis genes, namely C4H and CAD. Validation by RT-PCR and full-length gene

specific primers had confirmed the identities of the genes discovered. The full-length C4H

cDNA, designated as NcC4H is 1,651 bp long, with a 1,518 bp open reading frame encoding a

protein of 505 amino acids, a 18 bp 5’-UTR and a 115 bp 3’-UTR. The NcC4H showed higher

identity with the class I C4Hs, which is preferentially involved in the phenylpropanoid

biosynthesis pathway. Meanwhile, sequence analysis of the full-length CAD cDNA,

designated as NcCAD, showed that it is 1,240 bp long with a 1,086 bp open reading frame

encoding a protein of 361 amino acids, a 68 bp 5’-UTR and a 86 bp 3’-UTR. Phylogenetic

analysis revealed that NcCAD was grouped in the cluster containing both CAD and SAD genes,

in which both genes were involved in lignin biosynthesis. The full-length NcC4H and NcCAD

cDNA identified serve as good candidate genes for association genetics studies in

Kelampayan. Association genetics study is a powerful approach to detect potential genetic

variants, i.e. SNPs, underlying the common and complex adaptive traits. Thus far, single

nucleotide polymorphisms (SNPs) detected in C4H and CAD genes were known to be

iii

correlated with some other phenotypic variations in forest tree species rather than with lignin

production only. Hence, attempts were made to discover SNPs from the partial C4H and CAD

genomic DNA sequences. Overlapping primers were designed to flank the partial C4H and

CAD DNA from 12 Kelampayan samples. The amplified DNA fragments were cloned and

sent for sequencing. Furthermore, wood cores were collected and the basic wood density was

measured for each tree. Sequence variation analysis revealed that there were 60 and 32 SNPs

detected in the partial C4H and CAD DNA sequences, respectively. The SNPs detected were

distributed throughout the exon, intron, 5’-UTR and 3’-UTR regions. Among the SNPs

detected in the exon regions of C4H, 16 were synonymous mutations and eight were

nonsynonymous mutations. For CAD, six SNPs lead to synonymous mutations and one SNP

lead to nonsynonymous mutation. Synonymous mutations (71 %) were more common than

nonsynonymous mutations (29 %) for both the C4H and CAD genes. Association genetics

studies also revealed that four and six SNPs from the C4H and CAD genes respectively were

significantly associated with the basic wood density of Kelampayan (p<0.05). Genetic

variations identified by the SNP markers, once validated, will facilitate the selection of

Kelampayan parental lines or seedlings with optimal quality through the gene-assisted

selection (GAS) approach.

Keywords: Neolamarckia cadamba, Cinnamate 4-hydroxylase, Cinnamyl alcohol

dehydrogenase, Single nucleotide polymorphism, Association genetics study.

iv

GEN-BERKAITAN SNP PENEMUAN DAN PENGKLONAN CINNAMATE 4-

HYDROXYLASE DAN CINNAMYL ALCOHOL DEHYDROGENASE cDNA

SEMPURNA DARI POKOK TROPIKA Neolamarckia cadamba

ABSTRAK

Neolamarckia cadamba (nama tempatan: Kelampayan) adalah sejenis pokok asli yang dipilih

untuk penubuhan perladangan hutan di Sarawak disebabkan oleh produktiviti yang tinggi dan

putaran masa yang singkat. Kefahaman tentang struktur dan komposisi kayu Kelampayan

melalui integrasi genom adalah penting. Serentak itu, sumber pangkalan genomik data untuk

pembentukan kayu Kelampayan (Cadamomics, 10368 ESTs) telah dibentukkan dan ini

membolehkan para penyelidik untuk meneroka secara mendalam tentang asas genomik

Kelampayan. Pangkalan data EST adalah sumber yang amat berguna untuk penemuan gen.

Analisis lanjut dari pangkalan data ini menjana dua gen biosintesis lignin sempurna, iaitu

C4H dan CAD. Pengesahan dengan RT-PCR dan gen primer spesifik telah mengesahkan

identiti gen yang ditemui. C4H cDNA sempurna, atau NcC4H, ialah 1,651 bp panjang dengan

1,518 bp rangka bacaan terbuka yang mengekodkan 505 asid amino, 18 bp 5'-UTR dan

115bp 3'-UTR. NcC4H dikategori ke dalam kelas I C4Hs yang terutamanya terlibat dalam

proses biosintesis phenylpropanoid. Sementara itu, analisis bagi CAD cDNA sempurna, atau

NcCAD menunjukkan bahawa ia adalah 1,240 bp panjang dengan 1,086 bp rangka bacaan

terbuka yang mengekodkan 361 asid amino, 68bp 5'-UTR dan 86 bp 3'-UTR. Analisis

filogenetik menunjukkan bahawa NcCAD berada di dalam kelompok yang mengandungi

kedua-dua gen CAD dan SAD, di mana kedua-dua gen ini adalah terlibat dalam biosintesis

lignin. NcC4H dan NcCAD cDNA dapat digunakan sebagai gen calon yang baik untuk kajian

v

perhubungan genetik dalam Kelampayan. Kajian perhubungan genetik adalah satu

pendekatan yang ampuh untuk mengesan variasi genetik, contohnya polimorfisme nukleotida

tunggal (SNP) yang berpotensi untuk mendasari sifat-sifat adaptif umum dan kompleks bagi

Kelampayan. Setakat ini, SNP yang dikesan di gen C4H dan CAD telah diketahui mempunyai

hubungkait dengan beberapa variasi fenotip lain dalam spesies pokok hutan dan bukan

pembuatan lignin sahaja. Oleh itu, percubaan telah dibuat untuk mencari SNP dari separa

C4H dan CAD jujukan DNA. Primer-primer bertindih telah direka untuk mengamplifikasi

C4H dan CAD DNA daripada 12 sampel Kelampayan. Serpihan-serpihan DNA yang

diamplifikasikan telah diklon dan dihantar untuk penjujukan. Selain itu, teras kayu telah

dikumpul dan ketumpatan kayu asas diukur bagi setiap pokok. Analisis variasi jujukan telah

mengesan 60 dan 32 SNP di separa C4H dan CAD DNA masing-masing. SNP yang dikesan

terdapat di bahagian exon, intron, 5'-UTR dan 3'-UTR. Antara SNP yang dikesan di kawasan

exon C4H, 16 merupakan mutasi sinonim dan 8 adalah mutasi bukan sinonim. Untuk CAD, 6

SNP telah membawa kepada mutasi sinonim dan satu SNP telah membawa kepada mutasi

bukan sinonim dalam jujukan asid amino yang diterjemahkan. Mutasi sinonim (71%) adalah

lebih kerap berlaku daripada mutasi bukan sinonim (29%) bagi kedua-dua gen C4H dan

CAD. Kajian perhubungan genetik juga menunjukkan bahawa 4 dan 6 SNP dari gen C4H dan

CAD masing-masing mempunyai perhubungan yang signifikan dengan ketumpatan kayu

Kelampayan (p<0.05). Variasi genetik yang dikenalpasti oleh penanda SNP, apabila

disahkan, akan memudahkan pemilihan Kelampayan induk atau anak benih yang berkualiti

optimum melalui kaedah pemilihan gen-bantuan (GAS).

Kata kunci: Neolamarckia cadamba, Cinnamate 4-hydroxylase, Cinnamyl alcohol

dehydrogenase, Polimorfisme nukleotida tunggal (SNP), Kajian perhubungan genetik

vi

TABLE OF CONTENTS

ACKNOWLEDGMENTS i

ABSTRACT

ABSTRAK

ii

iv

TABLE OF CONTENTS vi

LIST OF TABLES xi

LIST OF FIGURES

xiii

LIST OF ABBREVIATIONS

xvii

CHAPTER I

INTRODUCTION 1

CHAPTER II LITERATURE REVIEW

2.1 Neolamarckia cadamba (Roxb.) Bosser 7

2.1.1 Anatomy Characteristics 7

2.1.2 Scientific Classification 10

2.1.3 Potential Uses of Kelampayan 11

2.1.4 Pharmacological Values of Kelampayan 11

2.1.5 Genomics Study of Kelampayan 11

2.2 Wood 12

2.2.1 Wood Density 13

2.2.2 Wood Formation 14

2.2.3 Plant Cell Wall 16

2.2.4 Cellulose 18

2.2.5 Hemicelluloses 19

2.2.6 Lignin 19

vii

2.2.7 Lignin Biosynthesis Pathway 21

2.2.7.1 Cinnamate 4-Hydroxylase (C4H) Gene 24

2.2.7.2 Cinnamyl Alcohol Dehydrogenase

(CAD) Gene

26

2.3 Molecular Markers for Forest Tree Genomics

Research

30

2.3.1 Single Nucleotide Polymorphisms (SNPs) 32

2.3.1.1 Advantages and Disadvantages of

SNPs

34

2.3.1.2 SNP Markers Development 35

2.3.1.3 Applications of SNPs in Genome

Analysis

37

2.4 Association Genetics Study 40

CHAPTER III

MATERIALS AND METHODS

3.1 C4H and CAD EST Data Analysis 48

3.2 Primer Design for Full-length C4H and CAD

cDNA

50

3.3 Total RNA Isolation 50

3.3.1 Plant Materials 50

3.3.2 Total RNA Isolation Protocol 51

3.4 Reverse Transcription of Total RNA 53

3.5 Cloning and Sequencing of Full-length C4H and

CAD cDNA

53

3.5.1 Rapid amplification of Full-length C4H and

CAD cDNA

53

3.5.2 Purification of PCR Amplicons from Agarose

Gel

54

3.5.3 cDNA Ligation 55

3.5.4 Transformation 56

3.5.5 Blue/White Colony Screening 56

viii

3.5.6 Plasmids Isolation 57

3.5.7 Confirmation for Desired Insert Trough

Restriction Enzyme Digestion

58

3.5.8 Sequencing 58

3.6 In Silico Analysis of Full-length C4H and CAD

cDNA

59

3.7 Sequence Variations of C4H and CAD Genes 60

3.7.1 Plant Materials 60

3.7.2 DNA Extraction 63

3.7.3 DNA Purification 64

3.7.4 Overlapping Primer Design 65

3.7.5 PCR Amplification 66

3.7.6 Cloning of PCR Amplicons 67

3.7.7 Sequencing 67

3.7.8 Sequence Variation Analysis 68

3.8 Wood Properties 69

3.8.1 Wood Cores Collection 69

3.8.2 Wood Density Measurement 69

3.9 Statistical Analysis 71

3.9.1 Nucleotide Diversity Analysis 71

3.9.2 Association Analysis 72

3.9.3 In silico Development of CAPs from SNP

Genotypes

72

CHAPTER IV

RESULTS AND DISCUSSION

4.1 C4H and CAD EST Data Analysis 73

4.2 Total DNA and RNA Extraction 78

4.2.1 Total RNA Extraction 78

4.2.2 Total DNA Extraction 80

ix

4.2.2.1 DNA Qualification and Quantification 81

4.3 Full-length C4H and CAD cDNA discovery and

analysis

82

4.3.1 Cinnamate 4-hydroxylase (C4H) 86

4.3.1.1 NcC4H cDNA Sequence 86

4.3.1.2 C4H Genomic DNA Sequence 89

4.3.1.3 In Silico Analysis of NcC4H cDNA

Sequence

92

4.3.2 Cinnamyl alcohol dehydrogenase (CAD) 99

4.3.2.1 Full-Length NcCAD cDNA Sequence 99

4.3.2.2 CAD Genomic DNA Sequence 101

4.3.2.3 In Silico Analysis of NcCAD cDNA

Sequence

104

4.4 Sequence Variations of C4H and CAD DNA

Sequences

110

4.4.1 Cloning of C4H and CAD DNA fragments 110

4.4.2 Single Nucleotide Polymorphisms (SNPs)

Analysis

115

4.4.3 Nucleotide Diversity Analysis 127

4.4.4 Phylogenetic Relationship Study Using SNP

Markers

130

4.4.5 Linkage Disequilibrium (LD) 131

4.5 Basic Wood density 136

4.6 Association Genetics Study 138

4.7 In Silico Restriction Enzymes Analysis

145

CHAPTER V CONCLUSIONS

147

REFERENCES

150

x

APPENDIXES

Appendix A

Repetitive sequencing results for selected samples 181

Appendix B

Sequences alignments showing the positions of start and

stop codons

185

Appendix C

Alignment of three full-length C4H and CAD clones

together with their respective hypothetical sequence

187

Appendix D

Gel purified PCR products of C4H and CAD genes after

analyzed on 1.5 % agarose gel. In total, four C4H

regions and three CAD regions were amplified and

purified from 12 Kelampayan samples

194

Appendix E

SNPs detection in C4H DNA sequences

196

Appendix F

SNPs detection in CAD DNA sequences

212

Appendix G

Synonymous and nonsynonymous mutations in C4H

and CAD amino acid sequences

223

Appendix H

Shared allele distance calculated using PowerMarker

software

227

Appendix I Neighbour joining trees constructed using shared allele

distance

230

Appendix J LD analysis of SNPs in CAD and C4H genes using

TASSEL v.3 software

231

Appendix K LD analysis of SNPs in C4H and CAD genes using

DnaSP5 software

244

xi

LIST OF TABLES

TABLE NO.

PAGE

Table 2.1 The pros and cons of low-lignin wood. 20

Table 2.2 Various molecular markers and their characteristics. 31

Table 2.3 Applications and features of various molecular markers. 38

Table 3.1 Ligation reaction mixture and volume. 55

Table 3.2 Restriction digestion reaction mixture and volume. 58

Table 3.3 DBH and GPS reading for 12 Kelampayan trees collected. 61

Table 3.4 PCR profile for each C4H and CAD primer sets. 66

Table 4.1 Blasting result of the hypothetical C4H amino acid sequence. 77

Table 4.2 Blasting result of the hypothetical CAD amino acid sequence. 77

Table 4.3 The concentration of total RNA extracted from developing

xylem tissue of Kelampayan.

79

Table 4.4 The concentration of total DNA extracted from inner bark tissue

of Kelampayan.

82

Table 4.5 Full-length forward and reverse primers designed for C4H and

CAD genes.

83

Table 4.6 The BLASTn output for full-length NcC4H cDNA sequence

discovered from Kelampayan.

88

Table 4.7 Introns and genes length of known C4H genomic DNA

sequences. The gene length is calculated from start codon to

stop codon.

91

Table 4.8 Comparison of NcC4H protein structure again structures in

PDB by using Dali server.

98

Table 4.9 The BLASTn output for full-length NcCAD cDNA sequence

discovered from Kelampayan.

99

xii

Table 4.10 Intron-exon structures of CAD genes from Populus. 103

Table 4.11 Comparison of NcCAD protein structure again structures in

PDB by using Dali server.

108

Table 4.12 Primers designed to flank the C4H and CAD genomic DNA

sequences.

111

Table 4.13 SNPs detected within partial C4H and CAD DNA sequences. 116

Table 4.14 Synonymous and nonsynonymous mutations in C4H and CAD

genes.

118

Table 4.15 The distribution of SNPs in C4H and CAD DNA sequences and

the resulted amino acid substitution.

119

Table 4.16 Haplotype (Hd) diversity, nucleotide diversity (θ and π) and

neutrality test statistics (D) in C4H and CAD candidate genes.

128

Table 4.17 Mean nucleotide diversity (θW and π) in different nucleotide

sites or gene regions for C4H and CAD candidate genes.

129

Table 4.18 The number of significant pairwise LD calculated using Chi-

square Test in DnaSP5 software.

132

Table 4.19 Estimate of the recombination parameter in the history of the

C4H and CAD loci.

135

Table 4.20 Basic wood density measured for 12 Kelampayan samples. 136

Table 4.21 Association test for SNPs from C4H genes with basic wood

density.

139

Table 4.22 Association test for SNPs from CAD genes with basic wood

density.

142

Table 4.23 Restriction enzymes identified for specific cutting on C4H and

CAD SNP regions.

146

xiii

LIST OF FIGURES

FIGURE NO. PAGE

Figure 2.1 (a) Kelampayan tree; (b) Kelampayan seedlings; (c) Kelampayan

flowers (Source: http://www.flickr.com/photos/ravi_gogte/

3821932711/); and (d) Kelampayan fruits (Krisnawati et al.,

2011).

8

Figure 2.2 Structure of wood. 15

Figure 2.3 Structure of plant cell wall. 18

Figure 2.4 Three type of monolignols. 21

Figure 2.5 Phenylpropanoid pathway leading to monolignol precursors of

lignin.

23

Figure 2.6 Schematic diagram of the pivotal role of C4H as a functional link

between the cytosolic enzymes of general phenylpropanoid

metabolism, PAL and 4CL, and the membrane-associated

electron-transfer reactions catalyzed by CPR.

25

Figure 2.7 Schematic representation of LD among genetic markers, genes

and a causal mutation.

41

Figure 2.8 Steps involved in conducting an association genetics study. 44

Figure 2.9 Genotyping strategies for candidate gene and genome-wide

association study (GWAS).

45

Figure 3.1 Grouping of the C4H EST singletons according to the alignment

score and position on gene.

49

Figure 3.2 Grouping of the CAD EST singletons according to the alignment

score and position on gene.

49

Figure 3.3 Collection of Kelampayan developing xylem tissue. 51

Figure 3.4 Collection of Kelampayan inner bark tissues. 60

Figure 3.5 The map for Kelampayan samples collected. 62

xiv

Figure 3.6 The locations of overlapping primers designed for C4H and CAD

DNA sequences.

65

Figure 3.7 Collection of Kelampayan Wood cores. 70

Figure 4.1 The hypothetical full-length C4H predicted through contig

mapping approach.

74

Figure 4.2 The hypothetical full-length CAD predicted through contig

mapping approach.

74

Figure 4.3 The hypothetical full-length C4H cDNA sequences. 75

Figure 4.4 The hypothetical full-length CAD cDNA sequences. 76

Figure 4.5 Total RNA isolated from developing xylem tissue of

Kelampayan.

79

Figure 4.6 Total DNA isolated from 12 inner bark tissues of Kelampayan. 80

Figure 4.7 Gel purified PCR products for full-length CAD and C4H cDNA. 83

Figure 4.8 Plasmids isolated for full-length CAD and C4H genes. 85

Figure 4.9 Restriction digestions on CAD and C4H plasmids by using

EcoRI restriction enzyme.

86

Figure 4.10 The full-length NcC4H cDNA sequence discovered from

Kelampayan. The start (ATG) and stop (TAG) codon were

bolded and underlined.

87

Figure 4.11 The NcC4H amino acid sequence translated for Kelampayan. 88

Figure 4.12 The C4H genomic DNA sequence discovered from Kelampayan. 90

Figure 4.13 The graphical presentation of full-length NcC4H cDNA and

partial C4H genomic DNA sequence.

91

Figure 4.14 Multiple alignment of NcC4H protein sequence with C4H

protein sequences from other species.

93

Figure 4.15 Classification of C4H genes from different plant species. 95

Figure 4.16 Phylogenetic tree constructed for NcC4H gene from Kelampayan

by using MEGA5 software.

96

xv

Figure 4.17 The secondary structure of NcC4H protein predicted by using

CDM software.

97

Figure 4.18 Tertiary structure of NcC4H protein predicted by using Phyre2. 98

Figure 4.19 The NcCAD cDNA sequence discovered from Kelampayan. The

start (ATG) and stop (TAG) codon were bolded and underlined.

100

Figure 4.20 The NcCAD amino acid sequence translated for Kelampayan. 100

Figure 4.21 The CAD genomic DNA sequences discovered from

Kelampayan. The start (ATG) and stop (TAG) codon were

bolded and underlined.

102

Figure 4.22 The graphical presentation of full-length NcCAD cDNA and

partial CAD genomic DNA sequence.

103

Figure 4.23 The motif domains detected within NcCAD amino acid

sequence.

104

Figure 4.24 Phylogenetic tree constructed for NcCAD gene from

Kelampayan by using MEGA5 software.

106

Figure 4.25 Phylogenetic tree showing the classification of CADs in CAD

gene family.

107

Figure 4.26 The secondary structure of NcCAD protein predicted by using

CDM software.

109

Figure 4.27 Tertiary structure of NcCAD protein modelled by using Phyre2

(colour by secondary structure).

110

Figure 4.28 Colony PCR product for CAD3 region after analyzed on 1.5 %

agarose gel.

112

Figure 4.29 Plasmid isolated from C4H1 region. Lane M1: Supercoiled DNA

ladder (Promega, USA); Lane M2: λ HindIII DNA marker

(Promega, USA).

113


ladder (Invitrogen, USA); Lane M2: λ HindIII DNA marker

(Promega, USA)

113

xvi



(Promega, USA).

113



(Promega, USA).

114

Figure 4.33 Plasmid isolated from CAD1 region. Lane M1: Supercoiled DNA


(Promega, USA).

114



(Promega, USA).

114



(Promega, USA).

115

Figure 4.36 Different nucleotide substitution observed in C4H and CAD

DNA sequences.

124

Figure 4.37 Two-nucleotides (A/C) and three-nucleotides (A/C/T) SNP

detected in C4H3 region.

125

Figure 4.38 The deletion of G nucleotide at position 894 of C4H3_NcMT2

fragment had change the amino acid sequence starting from the

mutation site.

126

Figure 4.39 Neighbour joining tree constructed using shared allele distance

calculated from the combination of CAD and C4H SNP data.

130

Figure 4.40 The TASSEL generated triangle plot for pairwise LD between

SNP marker sites in C4H and CAD gene fragments (above the

diagonal displays r2 values and below the diagonal displays the

corresponding p-values).

132

Figure 4.41 LD plot for all paired polymorphic sites in C4H and CAD genes

and fitted with a linear and logarithmic trend line.

133

xvii

LIST OF ABBREVATIONS

AFLP Amplified fragment length polymorphism

C4H Cinnamate 4-hydroxylase

CAD Cinnamyl alcohol dehydrogenase

CAPS Cleaved amplification polymorphic sequence

CCR Cinnamoyl-CoA reductase

DBH Diameter at breast height

EST Expressed sequence tag

GAS Gene-assisted selection

GLM General linear model

GWAS Genome-wide association study

Indel Insertion and deletion mutation

LD Linkage disequilibrium

LKCT Lesser-known commercial timbers

MAS Marker-assisted selection

PCR Polymerase chain reaction

QTL Quantitative loci

QTN Quantitative trait nucleotide

RAPD Random amplified polymorphic DNA

RFLP Restriction fragment length polymorphism

SAD Sinapyl alcohol dehydrogenase

SNP Single nucleotide polymorphism

SSR Simple sequence repeat

UTR Untranslated region

1

CHAPTER I

INTRODUCTION

Wood is undoubtedly the most versatile raw material available to human for construction,

paper, fuel and non-wood forest products. The available wood supplies from natural forests

being insufficient to meet the ever increased demand for wood, as the wood-using population

increased continuously (FAO, 2010). Rapid deforestation and declining in timber production

has imminent the planted forests development, with fast growing indigenous tree species

being emphasized. The economic advantages of planted forests with genetically superior trees

are therefore have gained great attentions worldwide (Sedjo, 1999). Hence, the future of the

planted forests as well as the forest products industry will rely upon the ability to domesticate

the indigenous tree species and adapt or alter them for maximum economic yield in the highly

controlled environments (Plomion et al., 2001).

Neolamarckia cadamba, or locally known as Kelampayan is one of the indigenous tree

species that are being selected for forest plantation establishment in Sarawak due to the high

productivity and short rotation time of this species (Sarawak Timber Industry Development

Corporation, 2009). Kelampayan is one of the lesser-known commercial timbers (LKCT)

which posses various benefits for wood-based industry such as picture frame, moulding,

skirting, wooden sandals, disposable chopstick, general utility furniture, veneer, plywood as

well as pulp and paper making (Lim et al., 2005). Moreover, the leaves and bark of

Kelampayan have been reported to have high pharmacological values (Joker, 2000; Patel and

2

Kumar, 2008). The root extracts from Kelampayan also can reduce the blood glucose

concentration and thus suggesting the utility of Kelampayan extracts in the treatment of

diabetes (Acharyya et al., 2010).

Planting stocks of Kelampayan is very important in ensuring the adequate supply of

timber for fulfilling the high demand in the market. Various efforts have been taken to

optimize the production scale and at high wood quality manner. To date, studies on

Kelampayan at molecular level are still limited. However, a Kelampayan wood formation

genomic resource database, aka Cadamomics (10,368 ESTs) have been developed by

researchers from the Forest Genomics and Informatics Laboratory, UNIMAS

(http://fgilab.com/) and Sarawak Forestry Corporation. This database has yielded an array of

useful information and resources for researchers to deeply explore the genomics basic of

Kelampayan.

Understanding of the molecular and physiological mechanisms of wood formation is

now considered as the main research area which must draw a serious attention upon diverse

disciplines, spanning from conventional biochemical and wood sciences to the genomics. To

date, study on adaptive traits especially in the wood formation has gained a lot of attention in

numerous forest tree species, like poplar (Sterky et al., 1998; Jansson and Douglas, 2007),

Eucalyptus (Rengel et al., 2009), Acacia hybrid (Yong et al., 2011), and loblolly pine

(Whetten et al., 2001). Therefore, understanding on the structure and composition of

Kelampayan wood through genome integration is vital to better utilize this wood materials.

3

Wood is a complex material composed of polymers of cellulose, hemicelluloses and

lignin that are physically and chemically bond together. Lignin, the second most abundant

organic compound after cellulose, represents approximately 20-30% of the plant biomass. It is

mainly found in supporting and conducting tissue of the plants such as fibers and tracheary

elements. Lignin is formed through dehydrogenative polymerization of monolignols known as

coumaryl alcohol, coniferyl alcohol and sinapyl alcohol, which give rise to ρ-coumaryl units

(H), guaiacyl (G) units, and sinapyl (S) units, respectively (Brett and Waldron, 1990). Due to

the mechanically rigid nature of lignin and the deposition on the cell wall, lignin is able to

offer mechanical and structural support to the plants, and allow the transportation of water

becomes smoother in the tracheids and vessels. Moreover, lignin is very resistant to

degradation in nature and thus has provided a significant protective function again pathogen

or decaying fungi (Brett and Waldron, 1990; Higuchi, 1997).

In paper and pulp industry, lignins have to be separated from cellulose and

hemicelluloses by an expensive and polluting process (Sederoff, 1999). In concern to this, the

study on lignin biosynthesis genes, such as CAD, gene encoded a cinnamyl alcohol

dehydrogenase and C4H, gene encoded a cinnamate 4-hydroxylase, had brought a new

discovery in paper and pulp making industry where any up- or down-regulation of these genes

will resulted in altered lignin production (Baucher et al., 2003). The main function of C4H is

to catalyze the hydroxylation of cinnamate to 4-coumarate at the early stage of lignin

biosynthesis pathway, while CAD catalyzes the reduction of cinnamaldehydes to ρ-coumaryl,

coniferyl and sinapyl alcohols during the final stage of lignin biosynthesis pathway (Lewis,

1999).

4

To date, there are considerable amounts of full-length C4H and CAD genes being

published in NCBI, but no such information available for Kelampayan trees. Moreover, C4H

and CAD genes are known to be correlated with some other phenotypic variations in forest

trees species rather than lignin production only (Yu et al., 2006; Gonzalez-Martinez et al.,

2007; Tchin et al., 2011; Schilmiller et al., 2009; Bjurhager et al., 2010; Wegrzyn et al.,

2010). Abreu et al. (2009) also proposed that the high ß-O-4 (Alkyl Aril Ether) bonds in

lignin of angiosperms may possibly affect the wood properties. Therefore, extensive study

especially at molecular level is needed to determine the correlation between C4H and CAD

genes with the Kelampayan wood properties.

Phenotypic variations among individuals may due to the cumulative effect of a number

of genes and/or the environmental influences (Bentz et al., 2011; Flatscher et al., 2012).

However, the basic genetic architecture of such complex adaptive traits is very difficult to

discover through traditional linkage-based approaches. Moreover, traditional linkage-based

approach for choosing planting materials with desired traits is laborious, time consuming and

expensive due to the requirement of mapping population establishment (Myles et al., 2009).

With advance in science and technology, a promising approach to study the complex adaptive

traits is to investigate the sequence variations (single nucleotide polymorphism, SNP) of

candidate genes at the nucleotide level, and the correlation of such variations with the

phenotypes of the trees (Neale and Kremer, 2011). This approach, namely association genetics

study, has been broadly embraced in forest trees studies, and is more powerful in

identification of the genes or loci that contribute to variation in complex traits (Long and

Langley, 1999). Association genetic study is a natural uncontrolled experiment that can give a

higher mapping resolution as compared with the linkage mapping (Myles et al., 2009). In the

5

space of just a few years, association genetics study has been widely applied in forest tree

species such as Pinus, Pseudotsuga, Populus and Eucalyptus (Neale and Kremer, 2011).

Single nucleotide polymorphism (SNP), where the sequences differ only in a single

nucleotide, has become marker of choice for association genetics study due to the abundance,

stable, ubiquity and interspersed characteristics of it in nuclear genome (Fusari et al., 2008).

SNPs are less polymorphic than other genetic markers for example simple sequence repeats

(SSRs), but still provide information regarding the genetic constituent of a living organism

(Rafalski, 2002a). SNPs are also far easier to be detected since only one single base in specific

sequences become target (Prince et al., 2001). Hayashi et al. (2004) also argued that SNP

markers are more preferable than RFLP markers due to the efficient and cost effectiveness of

SNP markers. In addition, SNP markers are also suitable for germplasm selection at early

seedling stage due to the amount of genomic DNA required for the detection is relatively low

(Hayashi et al., 2004).

Application of genomics science in Kelampayan breeding programs will improve our

understanding of their unique biology, and accelerate the discovery of genes controlling

economically and ecologically important traits through candidate gene-based association

genetics study. Gene-assisted selective breeding method in forest industry by using SNP

markers are expected to increase the selection efficiency and reduce the time and cost

associated with measuring the traits (Fusari et al., 2008). However, the prerequisite for the

determination of nucleotide polymorphism in candidate genes is the knowledge of the gene

sequences. A lot of partial or full-length sequences of genes are now accessible through online

Documents

Faculty of Resource Science and Technology SNP DISCOVERY AND...i ACKNOWLEDGEMENTS First, I would like to express my sincere appreciation and deepest gratitude to my obliging supervisor,