Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1/5/11
1
Announcements
• Labhour/discussionsec2on:– nextThursday2‐3pm,93Hutchison
• Assignment‐1– GeBngstartedwithPython&SQLite
– Out:endofthisweek,due:endofnextweek(tbd)• Readings:
– Twopaperspostedonwebsite–pleaseread!– Sendques2onstomailinglist!
• Visitsite,subscribetomailinglist(=group)– sites.google.com/site/ecs166wq11
• Newreadingmaterialsthere,addi2onalonlinebook,etc
– groups.google.com/group/ecs166wq11ECS‐166 1
Topics
• Introduc4on(SDM,e‐Science,scien2ficworkflows)
• Databases– Basicconceptsofrela2onaldatabases(crash‐course)– Metadata,ontologies
– Hands‐onexerciseswithSQLite,MySQL,Python
• Scien4ficWorkflows– Basicconcepts– Hands‐onexerciseswithKepler,Taverna,Python,…– DataProvenance(theory&examples)
– ParallelExecu2on(Map‐Reduce)
ECS‐166 2
The 4th Paradigm
3ECS‐166
Taverna workflow
• Chapter3,page137inthe4thParadigmbook.
• workflowconnectsinterna2onallydistributeddatasetstoiden2fycandidategenesthatcouldbeimplicatedinresistancetoAfricantrypanosomiasis(sleepingsickness)
ECS‐166 4
139THE FOURTH PARADIGM
FIGURE 1.
A Taverna workflow that connects several internationally distributed datasets to identify candi-date genes that could be implicated in resistance to African trypanosomiasis [11].
Get_pathways
Workflow Inputs
Workflow Outputs
Workflow Inputs
Workflow Outputs
kegg_pathway_release
binfo
merge_kegg_references
kegg_external_gene_reference
merge_pathway_list_1
merge_pathway_list_2merge_pathway_desc
remove_pathway_nulls
merge_entrez_genes
remove_Nulls
merge_genes_and_pathways
merge_genes_and_pathways_2
merge_uniprot_ids
REMOVE_NULLS_2
merge_genes_and_pathways_3
remove_duplicate_kegg_genes
gene_descriptions
gene_ids
merge_reports
reportmerged_pathways
regex_2
split_for_duplicates
species
getcurrentdatabase
concat_kegg_genes
split_gene_ids
remove_uniprot_duplicates remove_entrez_duplicates
remove_pathway_nulls_2
merge_gene_desc
remove_nulls_3
genes_in_qtl
mmusculus_gene_ensembl
create_report
pathway_descriptions
add_uniprot_to_string
Kegg_gene_ids
pathway_ids gene_descriptions
add_ncbi_to_string
Kegg_gene_ids_2
ensembl_database_releasekegg_pathway_release
regex
split_by_regex
Merge_pathway_desc
pathway_desc
Merge_pathways
concat_ids
pathway_desc
pathway_ids
Merge_gene_pathways
pathway_genes
lister
concat_gene_pathway_ids
get_pathways_by_genes1
remove_pathway_duplicates
chromosome_nameqtl_start_positionqtl_end_position
An_output_port An_input_port A_local_service Beanshell A_Soaplab_service String_constant A_Biomart_Service
Workflow Outputs Workflow Inputs
LiWengetal.GenomeRes.2006
Microbial Ecology, Metagenomics: what microbes are in my favorite environment?
STAP(ss‐rRNATaxonomyAssigningPipeline)D.Wu,A.L.Hartman,N.Ward,J.A.Eisen,PLoSONE,June2008
5ECS‐166
FindOTUs
(OTUHunter)
AssignTaxonomy(STAP)
Profilealignment
(STAPorInfernal)
Buildphylogene2ctree(RaxMLorQuicktree)
Viewtree:Dendroscope
UniFrac:tree&
environmentfile
Assembledcon2gs
Chimeracheck
(Mallard)
Diversitysta2s2cs:Text:OUTlist,Chao1,Shannon
Graphs:rarefac2oncurves,rank‐abundancecurves
Visualiza2ontools:Cytoscapenetworks&Heatmap
Metadata Metadata
WATERS: WorkflowforAlignment,Taxonomy,EcologyofRibosomalSequences(AmberHartman;EisenLab;UCDavis)
+/‐cipres
+/‐cluster
+/‐cluster
Metadata
+/‐cluster
6ECS‐166
1/5/11
2
Executable WATERS Workflow in Kepler
7ECS‐166
myExperiment.org
8ECS‐166
myExperimentallowsuserstofind,useandsharescien2ficworkflowsandotherResearchObjects,andtobuildcommuni2es.
Simple Kepler analysis workflow using
Data source from EcoGrid (metadata-driven ingestion)
res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res)
R processing script
DanHiggins,NCEAS
9ECS‐166
Scientific Workflow for Phylogenetic Analysis
Actors
Channels Ports
Tokensint,string,record{..},array[..],..
SciWF~executablespecofascien4ficdataanalysismethod
DrawTree
AA-Sequences
Clustal
Aligned AA-Sequences
Quicktree
Newick Tree
10ECS‐166
From “Climate Gate” to Reproducible Science
11ECS‐166
Provenance and Scientific Workflows prov•e•nance noun place of origin; derivation
Forthescien4st(focusondataderiva2on)– Evaluateresultsbasedonactorsanddataused,parameterseBngs,etc.
– Automatemetadatacrea2on
– Maintainarecordofwhatwasdonewithinaproject,etc.
– Provideahigh‐levelviewofwhataworkflowdid,dependencies,etc.
Fortheengineer(focusonprocessinghistory)– Monitor,benchmark,andop2mizeworkflowperformance
– Recordresourcesusedduringworkflowexecu2on– Checkpointandrestartworkflows– Op2miza2on(e.g.,minimizeunnecessaryrecomputa2ons)
1/5/11
3
Provenance questions a scientists might ask …
• WhichDNAsequenceswereinputtotheworkflow?
• Whichphylogene2ctreeswerecreated?• Whichactorcreatedthisphylogene2ctree?
• Whichinputsequencesdidthistreedependon?• Whatinputsequenceswerenotusedtoderiveanyoutput
consensustrees?
• Whatsequencealignmentwasusedtoinferthistree?
• Whichactorswereinvolvedincrea2ngthistree?
How can we answer these questions?
• Byrecordingwhathappensduringtheworkflowrun…
– Weopencalltheresultaworkflowexecu2ontrace
– Whatisrecordeddependsonwhatcanbe“observed”duringtherun
– Whatcanbeobserveddependsonthemodelofcomputa:on(MoC)
– Some2mestheMoCisn’tenough
…andtheobservablesmustbeaugmentedtocaptureprovenance
Provenance (Data Lineage) Graphs
• Scien4ficworkflows:tospecifyandexecutecomputa2onalpipelines• Provenanceinforma2onscapturedatalineageandprocessinghistory
• Workflow(Kepler/COMAD)
• Provenance
Scientific Workflows & Data Mining: Kepler/WEKA
16 ECS‐166
Topics
• Introduc4on(SDM,e‐Science,scien2ficworkflows)
• Databases– Basicconceptsofrela2onaldatabases(crash‐course)– Metadata,ontologies
– Hands‐onexerciseswithSQLite,MySQL,Python
• Scien4ficWorkflows– Basicconcepts– Hands‐onexerciseswithKepler,Taverna,Python,…– DataProvenance(theory&examples)
– ParallelExecu2on(Map‐Reduce)
ECS‐166 17
Introduction to Data(base) Management Why study data(base) management?
– Critical to business, government, science, culture, society, …
– Determines success of many corporations (even their existence)
– Many tech companies built on data management (Google, Amazon, Yahoo!, Facebook, …)
– … or offer database products (Microsoft, IBM, Oracle)
– Database systems span major areas of computer science • Operating systems (file, memory, process management) • Theory (languages, algorithms, complexity) • Artificial Intelligence (knowledge-based systems, logic, search) • Software Engineering (application development) • Data structures (trees, hash-tables) • … and the DB research community continues to be very active
18ECS‐166
1/5/11
4
Databases are everywhere (“Every-Ware”)
ECS‐166 19
Regularly Structured Data
Sets the structure once (e.g., table attributes) and then has many instances (records) that use that structure
• Examples of regularly structured data – Employee, payroll, bank account – Data captured on web forms
• Examples of unstructured – a.k.a. loosely or “semi-structured” data – Documents, (heaps of) video, audio, images, maps, …
20ECS‐166
We Focus on Regularly Structured Data We focus on relational database management systems
(abbreviated: DBMS or RDBMS)
– Mainly designed to store, manage, and retrieve structured data – We use SQL to manage and retrieve (query) data from databases
(abbreviated: DB)
Unstructured data (e.g., documents) is managed mainly by content management and information retrieval systems
– Includes search engines on the web – Querying involves indexing words in text, ranking results, etc. – Includes “Web 2.0” features like tagging/labeling
* Many DBMSs now support unstructured and semi-structured data too
21ECS‐166
Some Characteristics of Data in Databases Data is persistent
– One or more applications use the same data – Data stored between applications
Data often too large to easily manage in-memory – DBMSs handle this for free – Manually handling data (files) is usually ad hoc (each app. does it differently)
and can be inefficient
Data may be very large (business, government, science, …) – Library of congress > 20 terabytes of print – Amazon.com: > 42 terabytes of data – Youtube: > 45 terabytes of video – AT&T: > 323 terabytes of call records – National Energy Research Scientific Computing Center: > 2.8 petabytes
* 1 terabyte ≈ 1,000,000,000,000 bytes * 1 petabyte ≈ 1,000,000,000,000,000 bytes (and there is talk about exabytes at DOE)
22ECS‐166
Lots of Data Everywhere
• From http://en.wikipedia.org/wiki/Petabyte :
• History: According to Kevin Kelly in The New York Times, "the entire [written] works of humankind, from the beginning of recorded history, in all languages" would amount to 50 petabytes of data.[1]
• Computer hardware: Teradata Database 12 has a capacity of 50 petabytes of compressed data.[2][3]
• Telecoms: AT&T has about 16 petabytes of data transferred through their networks each day.[4]
• Archives: The Internet Archive contains about 3 petabytes of data, and is growing at the rate of about 100 terabytes per month as of March, 2009.[5][6]
• Internet: Google processes about 20 petabytes of data per day.[7] • Physics: The 4 experiments in the Large Hadron Collider will produce about 15 petabytes
of data per year, which will be distributed over the LHC Computing Grid.[8] • P2P networks: As of October 2009, Isohunt has about 9.76 petabytes of files contained in
torrents indexed globally.[9] • Games: World of Warcraft utilizes 1.3 petabytes of storage to maintain its game.[10]
ECS‐166 23
What is a DB?
A database (DB) is a (structured) collection of persistent data – NB (the picky guy): DB schema vs. DB instance
A database management system (DBMS) is a software system that supports the definition, population, and query of a database
24ECS‐166
DB
DBMS
1/5/11
5
Basic Database Architecture
25ECS‐166
File and Access Methods
Buffer Manager
Disk Space Manager
Recovery Manager
Transaction Manager
Lock Manager
Concurrency Control
System Catalog
Index Files
Data Files
Application Front Ends SQL Interface Web Forms
SQL Commands
Plan Executor
Operator Evaluator
Parser
Optimizer
Query Evaluation Engine
DBMS
Query Processing
26ECS‐166
File and Access Methods
Buffer Manager
Disk Space Manager
Recovery Manager
Transaction Manager
Lock Manager
Concurrency Control
System Catalog
Index Files
Data Files
Application Front Ends SQL Interface Web Forms
SQL Commands
Plan Executor
Operator Evaluator
Parser
Optimizer
Query Evaluation Engine
DBMS
Query Execution
Computer Science in a Nutshell …
“All computer science students must learn to integrate theory and practice, to recognize the importance of abstraction, and to appreciate the value of good engineering design.”
– Final report of the Joint ACM/IEEE-CS Task Force on Computing Curricula 2005 for Computer Science
This is one of the really fun things about studying database systems!!!
ECS‐166 27
Computer Science in a Nutshell … this course
ECS‐166 28
Practice
• Practical concepts • Skills • Tools
Theory
• Formal definitions • Mathematical results
Engineering
• Performance tradeoffs • Scalability • Reliability
Focus of the DB research community
Strong EmphasisImportant Formalizations
Formalization may not exactly match practical concept (often the core, e.g., SQL vs. Relational Algebra)
Only a Bit
Introduction to Relational Databases
• Assume this table has been defined to keep track of bank account
– Also referred to as a “relation”
ECS‐166 29
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
Relational Database Terminology
ECS‐166 30
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
The name of the table (relation)
1/5/11
6
Relational Database Terminology
ECS‐166 31
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
The name of the table (relation)The name of the “attributes” (columns)
Relational Database Terminology
• Theschemasetsthestructureofthetable• Theschemaisthedefini2onofthetable
– Whichgenerallyincludesmorethatwhatisshownhere– E.g.,datatypesandconstraints
ECS‐166 32
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
The “schema” of the table
Relational Database Terminology
• Each entry in the table is called a “row”, “tuple”, or “record” (often used interchangeably)
• The “instance” of the schema is the current set of rows
ECS‐166 33
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
…
Rows
InstanceRelational Database Terminology
• Not used as often in relational databases – mainly in deductive (logic-based) and object-oriented databases
ECS‐166 34
Number101102104105107109
OwnerJ.SmithW.WeiM.JonesH.Mar2nW.YuR.Jones
Balance1000.002000.001000.0010000.007500.00432.55
Typecheckingcheckingcheckingcheckingsavingschecking
Account
The“inten:on”ofthetable
Thecurrent“extension”(orextent)ofthetable
Relational Database Terminology
ECS‐166 35
Number101102104105107109
OwnerJ.SmithW.WeiM.JonesH.Mar2nW.YuR.Jones
Balance1000.002000.001000.0010000.007500.00432.55
Typecheckingcheckingcheckingcheckingsavingschecking
Account
“Degree”or“Arity”ofatableisthenumberofauributes
“Cardinality”ofatableisthenumberofrowsinthecurrentinstance
Arityofthisrela2onis4(becausethereare4auributes)
Cardinalityofthisinstanceis6(becausethereare6rows)
Relational Database Terminology
ECS‐166 36
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
Account102102104105
Transac2on‐id1234
Date10/22/0910/29/0910/29/0911/2/09
Amount500.00200.001000.0010000.00
Deposit
Account101101
Check‐number924925
Date10/23/0910/24/09
Amount125.0023.98
Check
1/5/11
7
Relational Database Terminology
• Each table (typically) has a “key”
• The values of the key must be unique
ECS‐166 37
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
Account102102104105
Transac2on‐id1234
Date10/22/0910/29/0910/29/0911/2/09
Amount500.00200.001000.0010000.00
Deposit
Account101101
Check‐number924925
Date10/23/0910/24/09
Amount125.0023.98
Check
WhatisthekeyfortheChecktable?
Relational Database Terminology
• A “key” consists of one or more attributes
• We often underline key attributes
ECS‐166 38
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
Account102102104105
Transac2on‐id1234
Date10/22/0910/29/0910/29/0911/2/09
Amount500.00200.001000.0010000.00
Deposit
Account101101
Check‐number924925
Date10/23/0910/24/09
Amount125.0023.98
Check
Relational Database Terminology
ECS‐166 39
Isthislegal?
Ifnot,howdowepreventitfromhappening?
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
AccountAccount102102104105106
Transac2on‐id12345
Date10/22/0910/29/0910/29/0911/2/0912/5/09
Amount500.00200.001000.0010000.00555.00
Deposit
Relational Database Terminology
• We say that Deposit.Account is a “foreign key” that references Account.Number – i.e., each Deposit (row) must refer to an Account (row)
• If the DBMS enforces this constraint, we have “referential integrity”
ECS‐166 40
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
AccountAccount102102104105106
Transac2on‐id12345
Date10/22/0910/29/0910/29/0911/2/0912/5/09
Amount500.00200.001000.0010000.00555.00
Deposit
Relational Database Terminology
• Are there any foreign keys in the Check table?
Yes, Check.Account is a foreign key that references Account.Number
ECS‐166 41
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
AccountAccount101101
Check‐number924925
Date10/23/0910/24/09
Amount125.0023.98
Check
Relational Database Terminology
• Foreign keys may or may not be part of the key for the table
ECS‐166 42
Number101102103104105
OwnerJ.SmithW.WeiJ.SmithM.JonesH.Mar2n
Balance1000.002000.005000.001000.0010000.00
Typecheckingcheckingsavingscheckingchecking
Account
Account102102104105
Transac2on‐id1234
Date10/22/0910/29/0910/29/0911/2/09
Amount500.00200.001000.0010000.00
Deposit
Account101101
Check‐number924925
Date10/23/0910/24/09
Amount125.0023.98
Check
Deposit.AccountisnotpartofthekeyforDeposit
Check.AccountispartofthekeyforDeposit
1/5/11
8
Relational Database Terminology
ECS‐166 43
Salesperson12
CompanyJonesSmith
Age2828
Commission$50,000$60,000
• Consider the following sample data from a table
Canyoutellwhatthekeyforthistableis?
Relational Database Terminology
ECS‐166 44
Salesperson12
CompanyJonesSmith
Age2828
Commission$50,000$60,000
• Consider the following sample data from a table
Nowcanyoutellwhatthekeyforthetableis?
Relational Database Terminology
ECS‐166 45
Id12
NameJonesSmith
Age2828
Salary$50,000$60,000
• One possibility:
Person Table with Id as the key
Relational Database Terminology
ECS‐166 46
Salesperson12
CompanyJonesSmith
Day2828
Commission$50,000$60,000
• Another possibility:
Sales Commission Table, by client company, per day
Keys,TableNames,A@ributeNames(helpto)telluswhatthetableis…
(…themeaning,or“seman:cs”,oftherela:on
=>aformofmetadata)
Relational Database Terminology
ECS‐166 47
• For every attribute of every table, the schema specifies allowable values. For example,
Number must be an integer Owner must be a 30-character string Type must be “checking” or “savings”
• The set of allowed values for an attribute is called the “domain” of the attribute
Number101102…
OwnerJ.SmithW.Wei
Balance1000.002000.00
Typecheckingchecking
Account
Specification of a Relational Database Schema
• Select the tables, with a name for each table – A database schema may have multiple tables – Each table has its own schema
• Select attributes for each table and give the domain for each attribute – This is the basis of a relation (or table) schema
• … also: Specify the key(s) for each table – There can be more than one key for a table – There is only one primary key (more on this later)
• Specify all appropriate foreign keys
ECS‐166 48
1/5/11
9
Specification of a Relational Database Schema
• Anotherexampledatabase– Morestandardnota2on;Eachtablehasoneprimarykey
Teacher(Number,Name,Office,E‐mail)
Course(Number,Name,Descrip2on)
Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)
Student(Number,Name,Major,Advisor)
Completed(Student,Course,Quarter,Sec2on,Grade)
ECS‐166 49
Specification of a Relational Database Schema
• Anotherexampledatabase– withsomeforeignkeysshowninformally
Teacher(Number,Name,Office,E‐mail)
Course(Number,Name,Descrip2on)
Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)
Student(Number,Name,Major,Advisor)
Completed(Student,Course,Quarter,Sec2on,Grade)
ECS‐166 50
Whatforeignkeysaremissing?
Specification of a Relational Database Schema
• Anotherexampledatabase– withsomeforeignkeysshowninformally
Teacher(Number,Name,Office,E‐mail)
Course(Number,Name,Descrip2on)
Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)
Student(Number,Name,Major,Advisor)
Completed(Student,Course,Quarter,Sec2on,Grade)
ECS‐166 51
Specification of a Relational Database Schema
• Anotherexampledatabase– withsomeforeignkeysshowninformally
Teacher(Number,Name,Office,E‐mail)
Course(Number,Name,Descrip2on)
Class‐Offering(Quarter,Course,Sec2on,Teacher,TimeDay)
Student(Number,Name,Major,Advisor)
Completed(Student,Course,Quarter,Sec2on,Grade)
ECS‐166 52
Whatarethelimita:onsofthisschema?