Upload
odelia
View
19
Download
2
Embed Size (px)
DESCRIPTION
Loading Data into GERMINATE. How data is loading into the GERMINATE tables. Loading Molecular Marker Data. - PowerPoint PPT Presentation
Citation preview
Loading Data into GERMINATE
How data is loading into the GERMINATE tables.
Loading Molecular Marker Data
DATA Ac1 Ac2 Ac3 Ac4 Ac5
M1 + - 0 + +
M2 + + + + +
M3 +/- + + + +
M4 - - - 0 -
M5 + - - +/- +
M6 0 - - - +
M7 - - - +/- -
M8 + - - + +
M9 - - - - -
M10 + + + +/- -
81 Ac1 76 ..
86 Ac2 76 ..
92 Ac3 76 ..
45 Ac4 76 ..
63 Ac5 76 ..
acce
ssio
n_id
acce
num
b
inst
code
_id
Accessions[dataset 1, dimension0]
ReferenceDataAccession metadataset
2 1 81 5
2 2 86 5
2 3 92 5
2 4 45 5
2 5 63 5
data
set_
id
inde
x_id
refe
renc
e_id
tabl
e_id
5 -> Accessions tablereference_id = accession_id
[dataset 1, dimension 1]
StringDataMarker metadataset
3 1 M1
3 2 M2
3 3 M3
3 4 M4
3 5 M5
3 6 M6
3 7 M7
3 8 M8
3 9 M9
3 10 M10
datasetid
indexid
string data
dataset 1
1 1 1
1 2 2
1 3 3
1 4 1
1 48 1
1 49 4
1 50 2
data
set_
id
inde
x_id
inte
ger_
data
(enu
m_i
ndex
)
1 Ac1
2 Ac2
3 Ac3
4 Ac4
5 Ac5
AccessionOrder
1 +
2 -
3 0
4 +
48 +
49 +/-
50 -
Data Order
+
-
0
+/-
Unique Allelic States
7 1 [+]
7 2 [-]
7 3 [0]
7 4 [+,-]
unitid
enumindex
EnumUnits-ArraysText
1 [1]
2 [2]
3 [3]
4 [1,2]
enumindex
AlleleIndex
text[ ]
allele index array
IntegerData
dataset 2
dataset 3
1 1 1 2 2 data
2 1 1 1Accession array for data
3 1 1 4 Marker array for data
data
set_
id
met
hod_
id
expe
rim
ent_
id2 0 1 5
3 1 1 10
met
adat
aset
_id
dim
ensi
on
data
set_
id
data
_typ
e_id
size
Metadatasets
Datasets
dataset_discriptiondim
ensi
on c
ount
Box A
Box BBox C
Box D
Box D displays the metadata information recorded in the database required to recreate the dataset. This includes the number of dimensions for a dataset and relates the metadatasets to the dataset.
The DATA table represents a sample of how molecular marker data are typically submitted; a set of markers analyzed in a set of accessions.
The arrows in the figure show flow of information as it is inserted into the database. Black arrows indicate data are being held temporarily, green indicates the insertion to the database and blue that data already inserted are being used to insert information into another table. In the latter case, ID’s assigned by the database are used to trace back to the original data. The colours in the tables follow the dataset and metadatasets through the process of being inserted into the database. The peach colour denotes the Accession metadataset, green denotes the Marker metadataset, and purple denotes the allele data.
Box A represents the Accession data and metadata inserted into GERMINATE. On entry, each Accession is assigned an accession_id which is unique in the database and this ID is used to reference the appropriate accession in the accession metadataset. The order or number of accession_id’s has no influence on the order of accessions in the metadataset. The ReferenceData table uses a data index to track the correct order of the accession_id’s.
Box B indicates where the marker information is inserted into the database, again retaining the order in the original dataset by the data index value.
Box C demonstrates how the allelic state of the accession by marker is translated into an integer id (enum_index). This ID is stored in appropriate order in the IntegerData table. The enum_index can then be used to translate back to the actual allele value or to an allele index if only the relative allele states between accessions are required in a query. The AlleleIndex table was created to speed up queries where technology is unimportant and the relative allele values will suffice to answer the question.
• 3 sets of data– Population data
•Stored in Pedigree table, reference to individuals in reference table which links population to the dataset.
– Data used to create linkage map•Stored similar to genetic data
– Genetic linkage map data
Genetic Map Data
Genetic Map DataLocus Linkage Group Position
L1 1 0.0
L2 1 9.7
L3 1 12.5
L4 1 17.5
L5 2 0.0
L6 2 5.2
L7 2 9.8
L8 2 14.2
L9 2 18.3
L10 2 23.5
L11 3 0.0
L12 3 2.1
L13 3 7.5
L14 3 15.6
L15 3 19.7
L16 3 25.6
Real Data
1 1 0.0
1 2 9.7
1 3 12.5
1 4 17.5
1 5 0.0
1 6 5.2
1 x #
1 16 25.6
data
set_
id
inde
x_id
real
_dat
a
PositionsString Data
Linkage GroupsString Data
Loci
3 1 1
3 2 1
3 3 1
3 4 1
3 5 2
3 6 2
3 x #
3 16 3
data
set_
id
inde
x_id
stri
ng_i
d
2 1 L1
2 2 L2
2 3 L3
2 4 L4
2 5 L5
2 6 L6
2 x name
2 16 L16
data
set_
id
inde
x_id
stri
ng_d
ata
Original Data
1 1 1 3 2 position
2 1 1 4 loci
3 1 1 4 Linkage group
data
set_
id
met
hod_
id
expe
rim
ent_
id
2 0 1 16
3 1 1 16
met
adat
aset
_id
dim
ensi
on
data
set_
id
data
_typ
e_id
size
MetadatasetsDatasets
dataset_discriptiondim
ensi
on c
ount
The positions for the loci in cM (indicated by the method, not shown here) is the primary dataset. The Linkage Groups and Loci are added as metadatasets for this dataset. Any additional information users may wish to store can be added as added dimensions to the dataset. The primary dataset is then linked to the populations and genetic data used to create the maps using the linking table. The Grey boxes are database assigned ID's
Trait DataM-0002 M-0002 M-002
E-0008 E-0139 E-0142
CGN02905 9=
CGN02906 9=
CGN02907 1=
CGN03353 9
CGN10142 1
CGN10143 9
CGN10163 9
CGN10165 9
CGN10166 9
CGN10167 9
CGN10168 9
CGN10169 9
experiment_id name date author_id description
1 E-0008 1985-01-01 1
2 E-0139 1988-01-01 1
3 E-0142 1989-01-01 1
method_id name unit_id description
1 M-002 1
dataset_id method_id experiment_id data_type_id description
1 1 1 1 Data for E-0008, M-0002
2 1 1 2 Accessions for E-0008, M-0002
3 1 2 1 Data for E-00139, M-0002
4 1 2 2 Accessions for E-0139, M-0002
5 1 3 1 Data for E-00142, M-0002
6 1 3 2 Accessions for E-0142, M-0002
unit_id name abbreviation description
1 State:M-002 State:M-002
Units
Methods
Experiments
metadataset_id dimension dataset_id size
2 0 1 4
4 0 3 3
6 0 5 5
Datasets
Metadatasets
Original Data
This trait data all uses the same method but three different experiments are done. Each experiment then has two datasets the data value and the accession. The colors follow the loading of each experiment into the database. The ID's (method_id, dataset_id, etc.) are assigned by the database.
Trait data
unit_id enum_index enum_value
1 1 9=
1 2 1=
1 3 9
1 4 1
EnumUnits
dataset_id data_index reference_id table_id
2 1 1025 4
2 2 1026 4
2 3 1029 4
2 4 1030 4
4 1 2584 4
4 2 1028 4
4 3 1035 4
6 1 1045 4
6 2 1046 4
6 3 2685 4
6 4 1050 4
6 5 1051 4
dataset_id data_index integer_data
1 1 1
1 2 1
1 3 2
1 4 3
3 1 3
3 2 4
3 3 4
5 1 4
5 2 4
5 3 4
5 4 4
5 5 4
IntegerData ReferenceData
enum_index allele_index
1 [1]
2 [2]
3 [3]
4 [4]
AlleleIndex
The data values are translated to an integer using the EnumUnits table and an integer loaded into the database. This is done because for large datasets searching a integer table will be faster than a string table. The reference_id here correspondes to the id for the accession in the original data entry.