9
1/5/11 1 Announcements Lab hour/discussion sec2on: next Thursday 2‐3pm, 93 Hutchison Assignment‐1 GeBng started with Python & SQLite Out: end of this week, due: end of next week (tbd) Readings: Two papers posted on web site – please read! Send ques2ons to mailing list! Visit site, subscribe to mailing list (=group) sites .google.com/site/ecs166wq11 New reading materials there, addi2onal online book, etc groups .google.com/group/ecs166wq11 ECS‐166 1 Topics Introduc4on (SDM, e‐Science, scien2fic workflows) Databases Basic concepts of rela2onal databases (crash‐course) Metadata, ontologies Hands‐on exercises with SQLite, MySQL, Python Scien4fic Workflows Basic concepts Hands‐on exercises with Kepler, Taverna, Python, … Data Provenance (theory & examples) Parallel Execu2on (Map‐Reduce) ECS‐166 2 The 4 th Paradigm 3 ECS‐166 Taverna workflow Chapter 3, page 137 in the 4 th Paradigm book. workflow connects interna2onally distributed datasets to iden2fy candidate genes that could be implicated in resistance to African trypanosomiasis (sleeping sickness) ECS‐166 4 Get_pathways Workflow Inputs Workflow Outputs Workflow Inputs Workflow Outputs kegg_pathway_release binfo merge_kegg_references kegg_external_gene_reference merge_pathway_list_1 merge_pathway_list_2 merge_pathway_desc remove_pathway_nulls merge_entrez_genes remove_Nulls merge_genes_and_pathways merge_genes_and_pathways_2 merge_uniprot_ids REMOVE_NULLS_2 merge_genes_and_pathways_3 remove_duplicate_kegg_genes gene_descriptions gene_ids merge_reports report merged_pathways regex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2 merge_gene_desc remove_nulls_3 genes_in_qtl mmusculus_gene_ensembl create_report pathway_descriptions add_uniprot_to_string Kegg_gene_ids pathway_ids gene_descriptions add_ncbi_to_string Kegg_gene_ids_2 ensembl_database_release kegg_pathway_release regex split_by_regex Merge_pathway_desc pathway_desc Merge_pathways concat_ids pathway_desc pathway_ids Merge_gene_pathways pathway_genes lister concat_gene_pathway_ids get_pathways_by_genes1 remove_pathway_duplicates chromosome_name qtl_start_position qtl_end_position An_output_port An_input_port A_local_service Beanshell A_Soaplab_service String_constant A_Biomart_Service Workflow Outputs Workflow Inputs Li Weng et al. Genome Res. 2006 Microbial Ecology, Metagenomics: what microbes are in my favorite environment? STAP (ss‐rRNA Taxonomy Assigning Pipeline) D. Wu, A.L. Hartman, N. Ward, J.A. Eisen, PLoS ONE, June 2008 5 ECS‐166 Find OTUs (OTUHunter) Assign Taxonomy (STAP) Profile alignment (STAP or Infernal) Build phylogene2c tree (RaxML or Quicktree) View tree: Dendroscope UniFrac: tree & environment file Assembled con2gs Chimera check (Mallard) Diversity sta2s2cs: Text: OUT list, Chao1, Shannon Graphs: rarefac2on curves, rank‐ abundance curves Visualiza2on tools: Cytoscape networks & Heat map Metadata Metadata WATERS: Workflow for Alignment, Taxonomy, Ecology of Ribosomal Sequences (Amber Hartman; Eisen Lab; UC Davis) +/‐ cipres +/‐ cluster +/‐ cluster Metadata +/‐ cluster 6 ECS‐166

1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

1

Announcements

•  Labhour/discussionsec2on:–  nextThursday2‐3pm,93Hutchison

•  Assignment‐1–  GeBngstartedwithPython&SQLite

–  Out:endofthisweek,due:endofnextweek(tbd)•  Readings:

–  Twopaperspostedonwebsite–pleaseread!–  Sendques2onstomailinglist!

•  Visitsite,subscribetomailinglist(=group)–  sites.google.com/site/ecs166wq11

•  Newreadingmaterialsthere,addi2onalonlinebook,etc

–  groups.google.com/group/ecs166wq11ECS‐166 1

Topics

•  Introduc4on(SDM,e‐Science,scien2ficworkflows)

•  Databases–  Basicconceptsofrela2onaldatabases(crash‐course)– Metadata,ontologies

–  Hands‐onexerciseswithSQLite,MySQL,Python

•  Scien4ficWorkflows–  Basicconcepts–  Hands‐onexerciseswithKepler,Taverna,Python,…–  DataProvenance(theory&examples)

–  ParallelExecu2on(Map‐Reduce)

ECS‐166 2

The 4th Paradigm

3ECS‐166

Taverna workflow

•  Chapter3,page137inthe4thParadigmbook.

•  workflowconnectsinterna2onallydistributeddatasetstoiden2fycandidategenesthatcouldbeimplicatedinresistancetoAfricantrypanosomiasis(sleepingsickness)

ECS‐166 4

139THE FOURTH PARADIGM

FIGURE 1.

A Taverna workflow that connects several internationally distributed datasets to identify candi-date genes that could be implicated in resistance to African trypanosomiasis [11].

Get_pathways

Workflow Inputs

Workflow Outputs

Workflow Inputs

Workflow Outputs

kegg_pathway_release

binfo

merge_kegg_references

kegg_external_gene_reference

merge_pathway_list_1

merge_pathway_list_2merge_pathway_desc

remove_pathway_nulls

merge_entrez_genes

remove_Nulls

merge_genes_and_pathways

merge_genes_and_pathways_2

merge_uniprot_ids

REMOVE_NULLS_2

merge_genes_and_pathways_3

remove_duplicate_kegg_genes

gene_descriptions

gene_ids

merge_reports

reportmerged_pathways

regex_2

split_for_duplicates

species

getcurrentdatabase

concat_kegg_genes

split_gene_ids

remove_uniprot_duplicates remove_entrez_duplicates

remove_pathway_nulls_2

merge_gene_desc

remove_nulls_3

genes_in_qtl

mmusculus_gene_ensembl

create_report

pathway_descriptions

add_uniprot_to_string

Kegg_gene_ids

pathway_ids gene_descriptions

add_ncbi_to_string

Kegg_gene_ids_2

ensembl_database_releasekegg_pathway_release

regex

split_by_regex

Merge_pathway_desc

pathway_desc

Merge_pathways

concat_ids

pathway_desc

pathway_ids

Merge_gene_pathways

pathway_genes

lister

concat_gene_pathway_ids

get_pathways_by_genes1

remove_pathway_duplicates

chromosome_nameqtl_start_positionqtl_end_position

An_output_port An_input_port A_local_service Beanshell A_Soaplab_service String_constant A_Biomart_Service

Workflow Outputs Workflow Inputs

LiWengetal.GenomeRes.2006

Microbial Ecology, Metagenomics: what microbes are in my favorite environment?

STAP(ss‐rRNATaxonomyAssigningPipeline)D.Wu,A.L.Hartman,N.Ward,J.A.Eisen,PLoSONE,June2008

5ECS‐166

FindOTUs

(OTUHunter)

AssignTaxonomy(STAP)

Profilealignment

(STAPorInfernal)

Buildphylogene2ctree(RaxMLorQuicktree)

Viewtree:Dendroscope

UniFrac:tree&

environmentfile

Assembledcon2gs

Chimeracheck

(Mallard)

Diversitysta2s2cs:Text:OUTlist,Chao1,Shannon

Graphs:rarefac2oncurves,rank‐abundancecurves

Visualiza2ontools:Cytoscapenetworks&Heatmap

Metadata Metadata

WATERS: WorkflowforAlignment,Taxonomy,EcologyofRibosomalSequences(AmberHartman;EisenLab;UCDavis)

+/‐cipres

+/‐cluster

+/‐cluster

Metadata

+/‐cluster

6ECS‐166

Page 2: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

2

Executable WATERS Workflow in Kepler

7ECS‐166

myExperiment.org

8ECS‐166

myExperimentallowsuserstofind,useandsharescien2ficworkflowsandotherResearchObjects,andtobuildcommuni2es.

Simple Kepler analysis workflow using

Data source from EcoGrid (metadata-driven ingestion)

res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res)

R processing script

DanHiggins,NCEAS

9ECS‐166

Scientific Workflow for Phylogenetic Analysis

Actors

Channels Ports

Tokensint,string,record{..},array[..],..

SciWF~executablespecofascien4ficdataanalysismethod

DrawTree

AA-Sequences

Clustal

Aligned AA-Sequences

Quicktree

Newick Tree

10ECS‐166

From “Climate Gate” to Reproducible Science

11ECS‐166

Provenance and Scientific Workflows prov•e•nance noun place of origin; derivation

Forthescien4st(focusondataderiva2on)–  Evaluateresultsbasedonactorsanddataused,parameterseBngs,etc.

–  Automatemetadatacrea2on

–  Maintainarecordofwhatwasdonewithinaproject,etc.

–  Provideahigh‐levelviewofwhataworkflowdid,dependencies,etc.

Fortheengineer(focusonprocessinghistory)–  Monitor,benchmark,andop2mizeworkflowperformance

–  Recordresourcesusedduringworkflowexecu2on–  Checkpointandrestartworkflows–  Op2miza2on(e.g.,minimizeunnecessaryrecomputa2ons)

Page 3: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

3

Provenance questions a scientists might ask …

•  WhichDNAsequenceswereinputtotheworkflow?

•  Whichphylogene2ctreeswerecreated?•  Whichactorcreatedthisphylogene2ctree?

•  Whichinputsequencesdidthistreedependon?•  Whatinputsequenceswerenotusedtoderiveanyoutput

consensustrees?

•  Whatsequencealignmentwasusedtoinferthistree?

•  Whichactorswereinvolvedincrea2ngthistree?

How can we answer these questions?

•  Byrecordingwhathappensduringtheworkflowrun…

–  Weopencalltheresultaworkflowexecu2ontrace

–  Whatisrecordeddependsonwhatcanbe“observed”duringtherun

–  Whatcanbeobserveddependsonthemodelofcomputa:on(MoC)

–  Some2mestheMoCisn’tenough

…andtheobservablesmustbeaugmentedtocaptureprovenance

Provenance (Data Lineage) Graphs

• Scien4ficworkflows:tospecifyandexecutecomputa2onalpipelines• Provenanceinforma2onscapturedatalineageandprocessinghistory

• Workflow(Kepler/COMAD)

• Provenance

Scientific Workflows & Data Mining: Kepler/WEKA

16 ECS‐166

Topics

•  Introduc4on(SDM,e‐Science,scien2ficworkflows)

•  Databases–  Basicconceptsofrela2onaldatabases(crash‐course)– Metadata,ontologies

–  Hands‐onexerciseswithSQLite,MySQL,Python

•  Scien4ficWorkflows–  Basicconcepts–  Hands‐onexerciseswithKepler,Taverna,Python,…–  DataProvenance(theory&examples)

–  ParallelExecu2on(Map‐Reduce)

ECS‐166 17

Introduction to Data(base) Management Why study data(base) management?

–  Critical to business, government, science, culture, society, …

–  Determines success of many corporations (even their existence)

–  Many tech companies built on data management (Google, Amazon, Yahoo!, Facebook, …)

–  … or offer database products (Microsoft, IBM, Oracle)

–  Database systems span major areas of computer science •  Operating systems (file, memory, process management) •  Theory (languages, algorithms, complexity) •  Artificial Intelligence (knowledge-based systems, logic, search) •  Software Engineering (application development) •  Data structures (trees, hash-tables) •  … and the DB research community continues to be very active

18ECS‐166

Page 4: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

4

Databases are everywhere (“Every-Ware”)

ECS‐166 19

Regularly Structured Data

Sets the structure once (e.g., table attributes) and then has many instances (records) that use that structure

•  Examples of regularly structured data –  Employee, payroll, bank account –  Data captured on web forms

•  Examples of unstructured –  a.k.a. loosely or “semi-structured” data –  Documents, (heaps of) video, audio, images, maps, …

20ECS‐166

We Focus on Regularly Structured Data We focus on relational database management systems

(abbreviated: DBMS or RDBMS)

–  Mainly designed to store, manage, and retrieve structured data –  We use SQL to manage and retrieve (query) data from databases

(abbreviated: DB)

Unstructured data (e.g., documents) is managed mainly by content management and information retrieval systems

–  Includes search engines on the web –  Querying involves indexing words in text, ranking results, etc. –  Includes “Web 2.0” features like tagging/labeling

* Many DBMSs now support unstructured and semi-structured data too

21ECS‐166

Some Characteristics of Data in Databases Data is persistent

–  One or more applications use the same data –  Data stored between applications

Data often too large to easily manage in-memory –  DBMSs handle this for free –  Manually handling data (files) is usually ad hoc (each app. does it differently)

and can be inefficient

Data may be very large (business, government, science, …) –  Library of congress > 20 terabytes of print –  Amazon.com: > 42 terabytes of data –  Youtube: > 45 terabytes of video –  AT&T: > 323 terabytes of call records –  National Energy Research Scientific Computing Center: > 2.8 petabytes

* 1 terabyte ≈ 1,000,000,000,000 bytes * 1 petabyte ≈ 1,000,000,000,000,000 bytes (and there is talk about exabytes at DOE)

22ECS‐166

Lots of Data Everywhere

•  From http://en.wikipedia.org/wiki/Petabyte :

•  History: According to Kevin Kelly in The New York Times, "the entire [written] works of humankind, from the beginning of recorded history, in all languages" would amount to 50 petabytes of data.[1]

•  Computer hardware: Teradata Database 12 has a capacity of 50 petabytes of compressed data.[2][3]

•  Telecoms: AT&T has about 16 petabytes of data transferred through their networks each day.[4]

•  Archives: The Internet Archive contains about 3 petabytes of data, and is growing at the rate of about 100 terabytes per month as of March, 2009.[5][6]

•  Internet: Google processes about 20 petabytes of data per day.[7] •  Physics: The 4 experiments in the Large Hadron Collider will produce about 15 petabytes

of data per year, which will be distributed over the LHC Computing Grid.[8] •  P2P networks: As of October 2009, Isohunt has about 9.76 petabytes of files contained in

torrents indexed globally.[9] •  Games: World of Warcraft utilizes 1.3 petabytes of storage to maintain its game.[10]

ECS‐166 23

What is a DB?

A database (DB) is a (structured) collection of persistent data –  NB (the picky guy): DB schema vs. DB instance

A database management system (DBMS) is a software system that supports the definition, population, and query of a database

24ECS‐166

DB

DBMS

Page 5: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

5

Basic Database Architecture

25ECS‐166

File and Access Methods

Buffer Manager

Disk Space Manager

Recovery Manager

Transaction Manager

Lock Manager

Concurrency Control

System Catalog

Index Files

Data Files

Application Front Ends SQL Interface Web Forms

SQL Commands

Plan Executor

Operator Evaluator

Parser

Optimizer

Query Evaluation Engine

DBMS

Query Processing

26ECS‐166

File and Access Methods

Buffer Manager

Disk Space Manager

Recovery Manager

Transaction Manager

Lock Manager

Concurrency Control

System Catalog

Index Files

Data Files

Application Front Ends SQL Interface Web Forms

SQL Commands

Plan Executor

Operator Evaluator

Parser

Optimizer

Query Evaluation Engine

DBMS

Query Execution

Computer Science in a Nutshell …

“All computer science students must learn to integrate theory and practice, to recognize the importance of abstraction, and to appreciate the value of good engineering design.”

–  Final report of the Joint ACM/IEEE-CS Task Force on Computing Curricula 2005 for Computer Science

This is one of the really fun things about studying database systems!!!

ECS‐166 27

Computer Science in a Nutshell … this course

ECS‐166 28

Practice

• Practical concepts • Skills • Tools

Theory

• Formal definitions • Mathematical results

Engineering

• Performance tradeoffs • Scalability • Reliability

Focus of the DB research community

Strong EmphasisImportant Formalizations

Formalization may not exactly match practical concept (often the core, e.g., SQL vs. Relational Algebra)

Only a Bit

Introduction to Relational Databases

•  Assume this table has been defined to keep track of bank account

–  Also referred to as a “relation”

ECS‐166 29

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

Relational Database Terminology

ECS‐166 30

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

The name of the table (relation)

Page 6: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

6

Relational Database Terminology

ECS‐166 31

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

The name of the table (relation)The name of the “attributes” (columns)

Relational Database Terminology

•  Theschemasetsthestructureofthetable•  Theschemaisthedefini2onofthetable

– Whichgenerallyincludesmorethatwhatisshownhere–  E.g.,datatypesandconstraints

ECS‐166 32

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

The “schema” of the table

Relational Database Terminology

•  Each entry in the table is called a “row”, “tuple”, or “record” (often used interchangeably)

•  The “instance” of the schema is the current set of rows

ECS‐166 33

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

Rows

InstanceRelational Database Terminology

•  Not used as often in relational databases –  mainly in deductive (logic-based) and object-oriented databases

ECS‐166 34

Number101102104105107109

OwnerJ.SmithW.WeiM.JonesH.Mar2nW.YuR.Jones

Balance1000.002000.001000.0010000.007500.00432.55

Typecheckingcheckingcheckingcheckingsavingschecking

Account

The“inten:on”ofthetable

Thecurrent“extension”(orextent)ofthetable

Relational Database Terminology

ECS‐166 35

Number101102104105107109

OwnerJ.SmithW.WeiM.JonesH.Mar2nW.YuR.Jones

Balance1000.002000.001000.0010000.007500.00432.55

Typecheckingcheckingcheckingcheckingsavingschecking

Account

“Degree”or“Arity”ofatableisthenumberofauributes

“Cardinality”ofatableisthenumberofrowsinthecurrentinstance

Arityofthisrela2onis4(becausethereare4auributes)

Cardinalityofthisinstanceis6(becausethereare6rows)

Relational Database Terminology

ECS‐166 36

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

Account102102104105

Transac2on‐id1234

Date10/22/0910/29/0910/29/0911/2/09

Amount500.00200.001000.0010000.00

Deposit

Account101101

Check‐number924925

Date10/23/0910/24/09

Amount125.0023.98

Check

Page 7: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

7

Relational Database Terminology

•  Each table (typically) has a “key”

•  The values of the key must be unique

ECS‐166 37

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

Account102102104105

Transac2on‐id1234

Date10/22/0910/29/0910/29/0911/2/09

Amount500.00200.001000.0010000.00

Deposit

Account101101

Check‐number924925

Date10/23/0910/24/09

Amount125.0023.98

Check

WhatisthekeyfortheChecktable?

Relational Database Terminology

•  A “key” consists of one or more attributes

•  We often underline key attributes

ECS‐166 38

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

Account102102104105

Transac2on‐id1234

Date10/22/0910/29/0910/29/0911/2/09

Amount500.00200.001000.0010000.00

Deposit

Account101101

Check‐number924925

Date10/23/0910/24/09

Amount125.0023.98

Check

Relational Database Terminology

ECS‐166 39

Isthislegal?

Ifnot,howdowepreventitfromhappening?

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

AccountAccount102102104105106

Transac2on‐id12345

Date10/22/0910/29/0910/29/0911/2/0912/5/09

Amount500.00200.001000.0010000.00555.00

Deposit

Relational Database Terminology

•  We say that Deposit.Account is a “foreign key” that references Account.Number –  i.e., each Deposit (row) must refer to an Account (row)

•  If the DBMS enforces this constraint, we have “referential integrity”

ECS‐166 40

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

AccountAccount102102104105106

Transac2on‐id12345

Date10/22/0910/29/0910/29/0911/2/0912/5/09

Amount500.00200.001000.0010000.00555.00

Deposit

Relational Database Terminology

•  Are there any foreign keys in the Check table?

Yes, Check.Account is a foreign key that references Account.Number

ECS‐166 41

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

AccountAccount101101

Check‐number924925

Date10/23/0910/24/09

Amount125.0023.98

Check

Relational Database Terminology

•  Foreign keys may or may not be part of the key for the table

ECS‐166 42

Number101102103104105

OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n

Balance1000.002000.005000.001000.0010000.00

Typecheckingcheckingsavingscheckingchecking

Account

Account102102104105

Transac2on‐id1234

Date10/22/0910/29/0910/29/0911/2/09

Amount500.00200.001000.0010000.00

Deposit

Account101101

Check‐number924925

Date10/23/0910/24/09

Amount125.0023.98

Check

Deposit.AccountisnotpartofthekeyforDeposit

Check.AccountispartofthekeyforDeposit

Page 8: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

8

Relational Database Terminology

ECS‐166 43

Salesperson12

CompanyJonesSmith

Age2828

Commission$50,000$60,000

•  Consider the following sample data from a table

Canyoutellwhatthekeyforthistableis?

Relational Database Terminology

ECS‐166 44

Salesperson12

CompanyJonesSmith

Age2828

Commission$50,000$60,000

•  Consider the following sample data from a table

Nowcanyoutellwhatthekeyforthetableis?

Relational Database Terminology

ECS‐166 45

Id12

NameJonesSmith

Age2828

Salary$50,000$60,000

•  One possibility:

Person Table with Id as the key

Relational Database Terminology

ECS‐166 46

Salesperson12

CompanyJonesSmith

Day2828

Commission$50,000$60,000

•  Another possibility:

Sales Commission Table, by client company, per day

Keys,TableNames,A@ributeNames(helpto)telluswhatthetableis…

(…themeaning,or“seman:cs”,oftherela:on

=>aformofmetadata)

Relational Database Terminology

ECS‐166 47

•  For every attribute of every table, the schema specifies allowable values. For example,

Number must be an integer Owner must be a 30-character string Type must be “checking” or “savings”

•  The set of allowed values for an attribute is called the “domain” of the attribute

Number101102…

OwnerJ.SmithW.Wei

Balance1000.002000.00

Typecheckingchecking

Account

Specification of a Relational Database Schema

•  Select the tables, with a name for each table –  A database schema may have multiple tables –  Each table has its own schema

•  Select attributes for each table and give the domain for each attribute –  This is the basis of a relation (or table) schema

•  … also: Specify the key(s) for each table –  There can be more than one key for a table –  There is only one primary key (more on this later)

•  Specify all appropriate foreign keys

ECS‐166 48

Page 9: 1/5/11ludaesch/ecs166wq11/166-02.pdfregex_2 split_for_duplicates species getcurrentdatabase concat_kegg_genes split_gene_ids remove_uniprot_duplicates remove_entrez_duplicates remove_pathway_nulls_2

1/5/11

9

Specification of a Relational Database Schema

•  Anotherexampledatabase– Morestandardnota2on;Eachtablehasoneprimarykey

Teacher(Number,Name,Office,E‐mail)

Course(Number,Name,Descrip2on)

Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)

Student(Number,Name,Major,Advisor)

Completed(Student,Course,Quarter,Sec2on,Grade)

ECS‐166 49

Specification of a Relational Database Schema

•  Anotherexampledatabase–  withsomeforeignkeysshowninformally

Teacher(Number,Name,Office,E‐mail)

Course(Number,Name,Descrip2on)

Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)

Student(Number,Name,Major,Advisor)

Completed(Student,Course,Quarter,Sec2on,Grade)

ECS‐166 50

Whatforeignkeysaremissing?

Specification of a Relational Database Schema

•  Anotherexampledatabase–  withsomeforeignkeysshowninformally

Teacher(Number,Name,Office,E‐mail)

Course(Number,Name,Descrip2on)

Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)

Student(Number,Name,Major,Advisor)

Completed(Student,Course,Quarter,Sec2on,Grade)

ECS‐166 51

Specification of a Relational Database Schema

•  Anotherexampledatabase–  withsomeforeignkeysshowninformally

Teacher(Number,Name,Office,E‐mail)

Course(Number,Name,Descrip2on)

Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)

Student(Number,Name,Major,Advisor)

Completed(Student,Course,Quarter,Sec2on,Grade)

ECS‐166 52

Whatarethelimita:onsofthisschema?