52

Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Embed Size (px)

DESCRIPTION

Scientific seminar at the Carlsberg Research Institute (CRI) in Copenhagen, Denmark on trait data mining using the Focused Identification of Germplasm Strategy (FIGS), 4th November 2009. Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Sci. 50(6):2418-2430. doi: 10.2135/cropsci2010.03.0174

Citation preview

Page 1: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Page 2: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

• Domes'ca'on  bo-leneck    • U'liza'on  of  gene'c  diversity  •  Core  collec'on  subset  selec'on  •  Trait  mining  selec'on  •  Computer  modeling  

•  Example  1:    • Nordic  Barley  Landraces  (2005)  • N-­‐PLS  regression  (in  MATLAB)  

•  Example  2:  • Net  blotch  in  barley  (ICARDA,  USDA)  • Discriminant  analysis  (DA)  

2  

Page 3: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

 

corn,  maize  

wild  tomato  

tomato  

teosinte  

Page 4: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

C  A  

B  

Tradi'onal  landraces  

A  A  

B  

Crop  Wild  Rela'ves  

A  A  

A  

Modern  cul'vars  

Gene/c  bo1lenecks  during  crop  domes/ca/on  and  modern  plant  breeding.  The  circles  represent  allelic  varia'on.  The  funnels  represents  allelic  varia'on  of  genes  found  in  the  crop  wild  rela'ves,  but  gradually  lost  during  domes'ca'on,  tradi'onal  cul'va'on  and  modern  plant  breeding.  

Page 5: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Page 6: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

•  Scien'sts  and  plant  breeders  want  a  few  hundred  germplasm  accessions  to  evaluate  for  a  par'cular  trait.  

•  How  does  the  scien'st  select  a  small  subset  likely  to  have  the  useful  trait?  

•  Example:  More  than  560  000  wheat  accessions  in  genebanks  worldwide.  

6  Slide  adopted  from  a  slide  by  Ken  Street,  ICARDA  (FIGS  team)  

Page 7: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

•  The  scien'st  or  the  breeder  need  a  smaller  subset  to  cope  with  the  field    screening  experiments.  

•  A  common  approach  is  to  create  a  so-­‐called  core  collec/on.  

7  

Sir  O-o  H.  Frankel  (1900-­‐1998)  proposed  a  limited  set  or  "core  collec'on”  established  from  an  exis'ng  collec'on  with  minimum  similarity  between  its  entries.  

The  core  collec'on  is  of  limited  size  and  chosen  to  represent  the  gene/c  diversity  of  a  large  collec'on,  a  crop,  a  wild  species  or  group  of  species  (1984)  .  

Page 8: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

•  Given  that  the  trait  property  you  are  looking  for  is  rela'vely  rare:  

•  Perhaps  as  rare  as  a  unique  allele  for  one  single  landrace  cul'var...  

•  Geeng  what  you  want  is  largely  a  ques'on  of  LUCK!  

8  Slide  adopted  from  a  slide  by  Ken  Street,  ICARDA  (FIGS  team)  

Page 9: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

9  

Page 10: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Wild  rela'ves  are  shaped    by  the  environment  

Primi've  cul'vated  crops  are  shaped  by  local  climate  and  humans  

Tradi'onal  cul'vated  crops  (landraces)  are  shaped  by  climate  and  humans  

Modern  cul'vated  crops  are  mostly  shaped  by  humans  (plant  breeders)  

Perhaps  future  crops  are  shaped  in  the  molecular  laboratory…?   10  

Page 11: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

 Objec/ve  of  this  study:    

– Explore  climate  data  as  a  predic'on  model  for  “pre-­‐screening”  of  crop  traits  BEFORE  full  scale  field  trials.  

–  Iden'fica'on  of  landraces  with  a  higher  probability  of  holding  an  interes'ng  trait  property.  

11  

Page 12: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

•  Primi/ve  crops  and  tradi/onal  landraces  are  an  important  source  for  novel  traits  for  improvement  of  modern  crops.  

•  Landraces  are  oien  not  well  described  for  the  economically  valuable  traits.  

•  Iden'fica'on  of  novel  crop  traits  will  oien  be  the  result  of  a  larger  field  trial  screening  project  (thousands  of  individual  plants).  

•  Large  scale  field  trials  are  very  costly,  area  and  human  working  hours.  

12  

Page 13: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

 The  underlying  assump'on  of  FIGS  selec'on  is  that  the  climate  at  the  original  source  loca'on,  where  the  landrace  was  developed  during  long-­‐term  tradi'onal  cul'va'on,  is  correlated  to  the  trait.    

 The  aim  is  to  build  a  computer  model  explaining  the  crop  trait  score  (dependent  variables)  from  the  climate  data  (independent  variables).  

13  

Page 14: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

1)  Landrace  samples  (genebank  seed  accessions)  2)  Trait  observa'ons  (experimental  design)  3)  Climate  data  (for  the  landrace  loca'on  of  origin)  

•   The  accession  iden'fier  (accession  number)  provides  the  bridge  to  the  crop  trait  observa'ons.  •   The  longitude,  la/tude  coordinates  for  the  original  collec'ng  site  of  the  accessions  (landraces)  provide  the  bridge  to  the  environmental  data.    

14  

Page 15: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

15  

Lima,  Peru  

Benin  

Alnarp,  Sweden  

Svalbard  

Page 16: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

h-p://barley.ipk-­‐gatersleben.de    

16  Powdery  Mildew,    Blumeria  graminis  

Leaf  spots  Ascochyta  sp.  

Yellow  rust  Puccinia  strilformis  

Black  stem  rust  Puccinia  graminis  

Faba  bean,  Finland   Field  trials,  Gatersleben,  Germany  

Forage  crops,  Dotnuva,  Lithuania   Radish  (S.  Jeppson)  

Potato  Priekuli  Latvia  

Linnés  äpple  

Page 17: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

 The  climate  data  is  extracted  from  the  WorldClim  dataset.    h-p://www.worldclim.org/    

 Data  from  weather  sta'ons  worldwide  are  combined    to  a  con'nuous  surface  layer.  

 Climate  data  for  each  landrace  is  extracted  from  this  surface  layer.   Precipita'on:  20  590  sta'ons  

Temperature:  7  280  sta'ons  17  

Page 18: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

FIGS  selec'on  is  a  new  method  to  predict  crop  traits  of  primi've  cul'vated  material  from  climate  variables  by  using  mul'variate  sta's'cal  methods.    

18  

Page 19: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Origin of Concept (1980s): Wheat and barley landraces from marine soils in the Mediterranean region provided genetic variation for boron toxicity.

What is

Slide made by Michael Mackay 1995

h-p://www.figstraitmine.org/    

19  

South  Australia  

Mediterranean  region  

Page 20: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

20  

FIGS    The  FIGS  technology  takes  much  of  the  guess  work  out  of  choosing  which  accessions  are  most  likely  to  contain  the  specific  characteris'cs  being  sought  by  plant  breeders  to  improve  plant  produc'vity  across  numerous  challenging  environments.        h-p://www.figstraitmine.org/    

20  

Page 21: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Slide made by Michael Mackay 1995

21  

Page 22: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

22  

•  No  sources  of  Sunn  pest  resistance  previously  found  in  hexaploid  wheat.  

•  2  000  accessions  screened  at  ICARDA  without  result  (during  last  7  years).  

•  A  FIGS  set  of  534  accessions  was  developed  and  screened  (2007,  2008).    

•  10  resistant  accessions  were  found!  •  The  FIGS  selec'on  started  from  16  000  landraces  

from  VIR,  ICARDA  and  AWCC  •  Exclude  origin  CHN,  PAK,  IND  were  Sunn  pest  only  

recently  reported  (6  328  acc).  •  Only  accession  per  collec'ng  site  (2  830  acc).  •  Excluding  dry  environments  below  280  mm/year  •  Excluding  sites  of  low  winter  temperature  below  10  

degrees  Celsius  (1  502  acc)  

Slide  adopted  from  Ken  Street,  ICARDA  (FIGS  team)  

Page 23: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

23  

Page 24: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

–  The  ini'al  model  is  developed  from  the  training  set  

–  Fine  tuning  of  model  parameters  and  seengs  

–  No  model  can  ever  be  absolutely  correct  –  A  simula'on  model  can  only  be  an  approxima'on  –  A  model  is  always  created  for  a  specific  purpose  

–  The  simula'on  model  is  applied  to  make  predic'ons  based  on  new  fresh  data  

–  Be  aware  to  avoid  extrapola'on  problems  24  

Page 25: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

–  For  the  ini'al  calibra'on  or  training  step.  

–  Further  calibra'on,  tuning  step  –  Oien  cross-­‐valida'on  on  the  

training  set  is  used  to  reduce  the  consump'on  of  raw  data.  

–  For  the  model  valida'on  or  goodness  of  fit  tes'ng.  

–  New  external  data,  not  used  in  the  model  calibra'on.  

25  

Page 26: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

26  

Page 27: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

27  

Page 28: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

28  

Page 29: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Sta/on   Al/tude   La/tude   Longitude  

Priekuli,  Latvia   83  m   57.3167   25.3667  

Bjørke  forsøksgård,  Norway   149  m   60.7667   11.2167  

Landskrona,  Sweden   3  m   55.8667   12.8333  

29  

Two  years:    •   2002  •   2003  

Page 30: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

accide AccNum Country Locality Eleva/on La/tude Longitude Coordinate

7436 NGB27 Finland Sarkalahti, Luumäki 95 m 61.0333 27.3333 SESTO

9717 NGB456 Norway Dønna, Nordland 71 m 66.1167 12.5 Georeferenced

9601 NGB468 Norway Trysil 400 m 61.2833 12.2833 Georeferenced

9600 NGB469 Norway BJØRNEBY 400 m 61.2833 12.2833 Georeferenced

7966 NGB775 Sweden Överkalix, Allsån 45 m 66.4 22.9333 SESTO

8510 NGB776 Sweden Överkalix 100 m 66.4 22.7667 SESTO

7810 NGB792 Finland Luusua, Kemijärvi 145 m 66.4833 27.35 SESTO

9538 NGB2072 Norway Finset 1220 m 60.6 7.5 Georeferenced

8482 NGB2565 Sweden Öland 11 m 56.7333 16.6667 Georeferenced

9102 NGB4641 Denmark Støvring, Jylland 55 m 56.8833 9.8333 Georeferenced

9015 NGB4701 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced

9039 NGB6300 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced

8531 NGB9529 Denmark Lyderupgaard 9 m 56.5667 9.35 Georeferenced

7344 NGB13458 Finland Koskenkylä, Rovaniemi 91 m 66.5167 25.8667 Georeferenced 30  

Page 31: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

From  a  total  of  19  landrace  accessions  included  in  the  dataset,  only  4  of  the  landrace  accessions  included  geo-­‐referenced  coordinates  in  the  NordGen  SESTO  database.    

10  accessions  were  geo-­‐referenced  from  the  reported  place  name  and  descrip'ons  of  the  original  gathering  site  included  in  SESTO  and  other  sources.    

For  5  accessions  there  were  not  enough  informa'on  available  to  locate  the  original  gathering  loca'on.  

Right  side  illustra.on    Example  of  georeferencing  for  NGB9529,  landrace  reported  

as  originaGng  from  Lyderupgaard  using  KRAK.dk  and  maps.google.com  

31  

Page 32: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

32  

Page 33: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

12  monthly  means  

14  land

races  

(loca'o

n  of  origin)  

12  14  

3  

Min.  temperature  

14  samples  

Climate  data  (mode  3):  •   Minimum  temperature  •   Maximum  temperature  •   Precipita'on  •   …  (many  more  layers  can  be  added)  

 

Jan,  Feb,  Mar,  …  

Max.  temperature  

Jan,  Feb,  Mar,  …  

Precipita'on  

Jan,  Feb,  Mar,  …  

33  

Page 34: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

6  traits  

14  land

races  (x2)  

6  28  

6  

6  traits  

Bjørke  (N)  2002  

6  traits   6  traits   6  traits   6  traits   6  traits  

28  records  

 

   Mode  2  (Traits)    *  Heading  days  *  Ripening  days  *  Length  of  plant  *  Harvest  index  *  Volumetric  weight  *  Grain  weight  (tgw)  

Bjørke  (N)  2003  

Landskrona  (S)  2003  

Landskrona  (S)  2002  

Priekuli  (Lv)  2002  

Priekuli  (Lv)  2003  

   Mode  3  *  LVA  2002  *  LVA  2003  *  NOR  2002  *  NOR  2003  *  SWE  2002  *  SWE2003  

34  

Page 35: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

35  

Page 36: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

36  

Page 37: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Mode  3  (climate  variables)  have  very  different  range  of    numerical  values  (tmin,  tmax,  and  prec).  Scaling  across  mode  3  is  thus  applied  to  the  mul'-­‐way  models.    

Lei  is  displayed  the  box-­‐plot  for  the  3-­‐way  data  unfolded  as  to  keep  the  dimensions  of  mode  3.  

The  3-­‐way  climate  data  was  reasonably  well  described  by  a  PARAFAC  model  of  two  components.  

tmin   tmax   prec  

Scaling  across  mode  3    

37  

tmin   tmax   prec  

Box  plot,  raw  data  

Page 38: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

PARAFAC  split-­‐half  (mode  1)  analysis:  

The  two  PARAFAC  models  each  calibrated  from  two  independent  split-­‐half  subsets,  both  converge  to  a  very  similar  solu'on  as  the  model  calibrated  from  the  complete  dataset.  

The  PARAFAC  model  is  thus  a  general  and  stable  model  for  the  scope  of    Scandinavia.  

38  

Page 39: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

39  

Page 40: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

•  Oien  the  cri'cal  levels  (α)  for  the  p-­‐value  significance    is  set  as  0.05,  0.01  and  0.001.  

•  For  the  modeling  of  14  samples  (landraces)  gives:  –  12  degrees  of  freedom  for  the  correla'on  tests  (mean  x,  y)  –  One-­‐tailed  test  (looking  only  at  posi've  correla'on  of  predic'ons  versus  the  reference  values).  

–  A  coefficient  of  determina'on  (r2)  larger  than  0.56  is  significant  at  the  0.001  (0.1%)  level  for  14  values/samples.  

Many  introductory  text  books  on  sta's'cs  include  a  table  of  Cri'cal  Values  for  Pearson’s  r.   40  

Page 41: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

 41  

Heading   Ripening   Length   H-­‐Index   Vol  wgt   TGW   Priekuli  (L)   Bjorke  (N)   Landskrona  (S)  

Page 42: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

42  

LVA  (2002)  

NOR  (2002)  

LVA  (2003)  

NOR  (2003)  

SWE  (2002)  

SWE  (2003)  

Page 43: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

43  

•  Latvia  2002  (LY11)  – May  2002  was  extreme  dry  in  Priekuli.  –  June  2002  was  extreme  wet  in  Priekuli.  –  The  wet  June  caused  germina'on  on  the  spikes  for  many  of  the  early  varie'es.  

•  Landskrona  2003  (LY32)  –  June  2003  was  extreme  dry  in  Landskrona.  –  June  was  the  'me  for  grain  filling  here.  

•  Too  extreme  for  the  genotype  to  be  “normally”  expressed  ?  

•  Too  large  effect  from  “G  by  E”  interac'on  ?  

Page 44: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

Sta/on   Year  Sowing  week  

Rainfall  (mm)  

May   June   July   August  

Bjørke  forsøksgård,  Norway   2002   17   82.9   67.4   128.5   136.5  

2003   21   75.1   85.7   67.1   53.2  

Landskrona,  Sweden   2002   13   53.5   75.3   76.4   68.9  

2003   15   70.7   40.4   76.0   45.7  

Priekuli,  Latvia   2002   17   38.2   111.1   67.0   11.3  

2003   19   88.0   59.2   87.8   175.8  

44  

Page 45: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

   

   

45  

Page 46: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

46  

Page 47: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

47  

Page 48: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

48  

•  The first dataset I started to work with is a “FIGS” dataset with genebank accessions of Barley (Hordeum vulgare ssp. vulgare) collected from different countries worldwide and tested for susceptibility of net blotch infection. Net blotch is a common disease of barley caused by the fungus Pyrenophora teres.  

•  The barley plants were inoculated with the fungus and the percentage of the leaves infected with the disease was normalized to an interval scale (1 to 9).

•  1-3 are basically resistant group 1 •  4-6 are intermediate group 2 •  7-9 are susceptible group 3

Page 49: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

49  

•  Agro-­‐clima'c  Zone  (UNESCO  classifica'on)  •  Soil  classifica'on  (FAO  Soil  map)  •  Aridity  (dryness)  •  Precipita'on  •  Poten'al  evapotranspira'on  (water  loss)  •  Temperature    •  Maximum  temperatures    •  Minimum  temperatures  

 (mean  values  for  month  and  year)  

Page 50: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

50  

•  The  correctly  classified  groups  for  the  training  dataset  was  45.9%,  and  we  would  expect  a  similar  success  rate  for  the  predic'on  of  the  “blinded”  values.  

•  Remember  that  random  classifica'on  of  three  groups  are:  33.3%  

•  A  test  set  of  9  samples  showed  a  propor'on  correct  classifica'ons  of  44.4%  

Discriminant Analysis: obs_nb versus acz_moisture; ...  

Quadratic Method for Response: obs_nb  

Predictors: acz_moisture; acz_winter_temp;

acz_summer_temp; arid_annual;  pet_annual;

prec_annual; temp_annual; tmax_annual;

tmin_annual  

Group 1 2 3  

Count 1049 1190 234  

Summary of classification  

Put into Group 1 2 3  

1 523 427 48  

2 287 451 25  

3 238 314 163  

Total N 1048 1192 236  

N correct 523 451 163  

Proportion 0,499 0,378 0,691  

N = 2476 N Correct = 1137

Proportion Correct = 0,459    

Page 51: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

51  

Eddy  De  Pauw  Climate  data  

Harold  Bockelman  Net  blotch  data  

Ken  Street  FIGS  project  leader  

Michael  Mackay  FIGS  coordinator  

Dag  Endresen  Data  analysis  

Page 52: Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)

52