Loading Data into GERMINATE

Loading Data into GERMINATE

How data is loading into the GERMINATE tables.

Loading Molecular Marker Data

DATA Ac1 Ac2 Ac3 Ac4 Ac5

M1 + - 0 + +

M2 + + + + +

M3 +/- + + + +

M4 - - - 0 -

M5 + - - +/- +

M6 0 - - - +

M7 - - - +/- -

M8 + - - + +

M9 - - - - -

M10 + + + +/- -

81 Ac1 76 ..

86 Ac2 76 ..

92 Ac3 76 ..

45 Ac4 76 ..

63 Ac5 76 ..

acce

ssio

n_id

acce

num

b

inst

code

_id

Accessions[dataset 1, dimension0]

ReferenceDataAccession metadataset

2 1 81 5

2 2 86 5

2 3 92 5

2 4 45 5

2 5 63 5

data

set_

id

inde

x_id

refe

renc

e_id

tabl

e_id

5 -> Accessions tablereference_id = accession_id

[dataset 1, dimension 1]

StringDataMarker metadataset

3 1 M1

3 2 M2

3 3 M3

3 4 M4

3 5 M5

3 6 M6

3 7 M7

3 8 M8

3 9 M9

3 10 M10

datasetid

indexid

string data

dataset 1

1 1 1

1 2 2

1 3 3

1 4 1

1 48 1

1 49 4

1 50 2

data

set_

id

inde

x_id

inte

ger_

data

(enu

m_i

ndex

)

1 Ac1

2 Ac2

3 Ac3

4 Ac4

5 Ac5

AccessionOrder

1 +

2 -

3 0

4 +

48 +

49 +/-

50 -

Data Order

+

-

0

+/-

Unique Allelic States

7 1 [+]

7 2 [-]

7 3 [0]

7 4 [+,-]

unitid

enumindex

EnumUnits-ArraysText

1 [1]

2 [2]

3 [3]

4 [1,2]

enumindex

AlleleIndex

text[ ]

allele index array

IntegerData

dataset 2

dataset 3

1 1 1 2 2 data

2 1 1 1Accession array for data

3 1 1 4 Marker array for data

data

set_

id

met

hod_

id

expe

rim

ent_

id2 0 1 5

3 1 1 10

met

adat

aset

_id

dim

ensi

on

data

set_

id

data

_typ

e_id

size

Metadatasets

Datasets

dataset_discriptiondim

ensi

on c

ount

Box A

Box BBox C

Box D

Box D displays the metadata information recorded in the database required to recreate the dataset. This includes the number of dimensions for a dataset and relates the metadatasets to the dataset.

The DATA table represents a sample of how molecular marker data are typically submitted; a set of markers analyzed in a set of accessions.

The arrows in the figure show flow of information as it is inserted into the database. Black arrows indicate data are being held temporarily, green indicates the insertion to the database and blue that data already inserted are being used to insert information into another table. In the latter case, ID’s assigned by the database are used to trace back to the original data. The colours in the tables follow the dataset and metadatasets through the process of being inserted into the database. The peach colour denotes the Accession metadataset, green denotes the Marker metadataset, and purple denotes the allele data.

Box A represents the Accession data and metadata inserted into GERMINATE. On entry, each Accession is assigned an accession_id which is unique in the database and this ID is used to reference the appropriate accession in the accession metadataset. The order or number of accession_id’s has no influence on the order of accessions in the metadataset. The ReferenceData table uses a data index to track the correct order of the accession_id’s.

Box B indicates where the marker information is inserted into the database, again retaining the order in the original dataset by the data index value.

Box C demonstrates how the allelic state of the accession by marker is translated into an integer id (enum_index). This ID is stored in appropriate order in the IntegerData table. The enum_index can then be used to translate back to the actual allele value or to an allele index if only the relative allele states between accessions are required in a query. The AlleleIndex table was created to speed up queries where technology is unimportant and the relative allele values will suffice to answer the question.

• 3 sets of data– Population data

•Stored in Pedigree table, reference to individuals in reference table which links population to the dataset.

– Data used to create linkage map•Stored similar to genetic data

– Genetic linkage map data

Genetic Map Data

Genetic Map DataLocus Linkage Group Position

L1 1 0.0

L2 1 9.7

L3 1 12.5

L4 1 17.5

L5 2 0.0

L6 2 5.2

L7 2 9.8

L8 2 14.2

L9 2 18.3

L10 2 23.5

L11 3 0.0

L12 3 2.1

L13 3 7.5

L14 3 15.6

L15 3 19.7

L16 3 25.6

Real Data

1 1 0.0

1 2 9.7

1 3 12.5

1 4 17.5

1 5 0.0

1 6 5.2

1 x #

1 16 25.6

data

set_

id

inde

x_id

real

_dat

a

PositionsString Data

Linkage GroupsString Data

Loci

3 1 1

3 2 1

3 3 1

3 4 1

3 5 2

3 6 2

3 x #

3 16 3

data

set_

id

inde

x_id

stri

ng_i

d

2 1 L1

2 2 L2

2 3 L3

2 4 L4

2 5 L5

2 6 L6

2 x name

2 16 L16

data

set_

id

inde

x_id

stri

ng_d

ata

Original Data

1 1 1 3 2 position

2 1 1 4 loci

3 1 1 4 Linkage group

data

set_

id

met

hod_

id

expe

rim

ent_

id

2 0 1 16

3 1 1 16

met

adat

aset

_id

dim

ensi

on

data

set_

id

data

_typ

e_id

size

MetadatasetsDatasets

dataset_discriptiondim

ensi

on c

ount

The positions for the loci in cM (indicated by the method, not shown here) is the primary dataset. The Linkage Groups and Loci are added as metadatasets for this dataset. Any additional information users may wish to store can be added as added dimensions to the dataset. The primary dataset is then linked to the populations and genetic data used to create the maps using the linking table. The Grey boxes are database assigned ID's

Trait DataM-0002 M-0002 M-002

E-0008 E-0139 E-0142

CGN02905 9=

CGN02906 9=

CGN02907 1=

CGN03353 9

CGN10142 1

CGN10143 9

CGN10163 9

CGN10165 9

CGN10166 9

CGN10167 9

CGN10168 9

CGN10169 9

experiment_id name date author_id description

1 E-0008 1985-01-01 1

2 E-0139 1988-01-01 1

3 E-0142 1989-01-01 1

method_id name unit_id description

1 M-002 1

dataset_id method_id experiment_id data_type_id description

1 1 1 1 Data for E-0008, M-0002

2 1 1 2 Accessions for E-0008, M-0002

3 1 2 1 Data for E-00139, M-0002


5 1 3 1 Data for E-00142, M-0002


unit_id name abbreviation description

1 State:M-002 State:M-002

Units

Methods

Experiments

metadataset_id dimension dataset_id size

2 0 1 4

4 0 3 3

6 0 5 5

Datasets

Metadatasets

Original Data

This trait data all uses the same method but three different experiments are done. Each experiment then has two datasets the data value and the accession. The colors follow the loading of each experiment into the database. The ID's (method_id, dataset_id, etc.) are assigned by the database.

Trait data

unit_id enum_index enum_value

1 1 9=

1 2 1=

1 3 9

1 4 1

EnumUnits

dataset_id data_index reference_id table_id

2 1 1025 4

2 2 1026 4

2 3 1029 4

2 4 1030 4

4 1 2584 4

4 2 1028 4

4 3 1035 4

6 1 1045 4

6 2 1046 4

6 3 2685 4

6 4 1050 4

6 5 1051 4

dataset_id data_index integer_data

1 1 1

1 2 1

1 3 2

1 4 3

3 1 3

3 2 4

3 3 4

5 1 4

5 2 4

5 3 4

5 4 4

5 5 4

IntegerData ReferenceData

enum_index allele_index

1 [1]

2 [2]

3 [3]

4 [4]

AlleleIndex

The data values are translated to an integer using the EnumUnits table and an integer loaded into the database. This is done because for large datasets searching a integer table will be faster than a string table. The reference_id here correspondes to the id for the accession in the original data entry.

Documents

Loading Data into GERMINATE