Genome data management

Genome Data ManagementGenome Data Management

Shabeer Ismaeel MSC IT II SEMESTER Department Of Information Technology .

• Biological Sciences.Biological Sciences.• Genetics.Genetics.• Characteristics of Biological Data.Characteristics of Biological Data.• What is Bioinformatics?What is Bioinformatics?• Human Genome and availability of Human Genome and availability of

information . information . • Existing Biological Databases.Existing Biological Databases.• Various Branches Benefited.Various Branches Benefited.

Contents

Biological Sciences.Biological Sciences.

– The biological sciences encompass an enormous The biological sciences encompass an enormous variety of information.variety of information.

• EnvironmentalEnvironmental science science gives us a view of how species gives us a view of how species live and interact in a world filled with natural phenomena.live and interact in a world filled with natural phenomena.

• BiologyBiology and and ecologyecology study particular species. study particular species. • AnatomyAnatomy focuses on the overall structure of an organism, focuses on the overall structure of an organism,

documenting the physical aspects of individual bodies.documenting the physical aspects of individual bodies.• Traditional medicine and physiology Traditional medicine and physiology break the organism break the organism

into systems and tissues and strive to collect information into systems and tissues and strive to collect information on the workings of these systems and the organism as a on the workings of these systems and the organism as a wholewhole. .

• Histology and cell biology Histology and cell biology delve into the delve into the tissue and cellular levels and provide tissue and cellular levels and provide knowledge about the inner structure and knowledge about the inner structure and function of the cell. function of the cell. -This wealth of information that has been -This wealth of information that has been

generated, classified, and stored for generated, classified, and stored for centuries has only recently become a centuries has only recently become a major application of database major application of database

technology.technology.

Genetics.Genetics.• GeneticsGenetics has emerged as an ideal field has emerged as an ideal field

for the application of information for the application of information technology.technology.– In a broad sense, it can be taught of as the In a broad sense, it can be taught of as the

construction of models based on construction of models based on information about genes and population information about genes and population and the seeking out of relationships in that and the seeking out of relationships in that information.information.• Genes can be defined as units of heredity.Genes can be defined as units of heredity.

-The study of genetics can be divided into three -The study of genetics can be divided into three branches:branches:MendelianMendelian genetics genetics is the study of the is the study of the transmission of traits between generations.transmission of traits between generations.MolecularMolecular genetics genetics is the study of the chemical is the study of the chemical structure and function of genes at the molecular structure and function of genes at the molecular level. level. PopulationPopulation genetics genetics is the study of how genetic is the study of how genetic information varies across populations of information varies across populations of organisms.organisms.

The origins of The origins of molecular geneticsmolecular genetics can be traced to can be traced to two important discoveries:two important discoveries:- In 1869 when Friedrich Miescher discovered - In 1869 when Friedrich Miescher discovered Nuclein and its primary component, Nuclein and its primary component, deoxyribonucleic acid (DNA).deoxyribonucleic acid (DNA).In subsequent research DNA and a related compound, In subsequent research DNA and a related compound, ribonucleic acid, were found to be composed of nucleotides (a ribonucleic acid, were found to be composed of nucleotides (a sugar, a phosphate, and a base combining to form nucleic acid) sugar, a phosphate, and a base combining to form nucleic acid) linked into long polymers via the sugar and phosphate.linked into long polymers via the sugar and phosphate.--The second discovery was the demonstration in The second discovery was the demonstration in 1944 by Oswald Avery that DNA was indeed the 1944 by Oswald Avery that DNA was indeed the molecular substance carrying genetic information.molecular substance carrying genetic information.

Genes were shown to be composed of chains of Genes were shown to be composed of chains of nucleic acids arranged linearly on chromosomes and nucleic acids arranged linearly on chromosomes and to serve three primary functions:to serve three primary functions:-Replicating genetic information between -Replicating genetic information between generations,generations,-Providing blueprints for the creation of polypeptides, -Providing blueprints for the creation of polypeptides, andand-Accumulating changes– thereby allowing evolution -Accumulating changes– thereby allowing evolution to occur.to occur.------------------Watson and Crick found the double-helix Watson and Crick found the double-helix structure of the DNA in 1953, which gave molecular structure of the DNA in 1953, which gave molecular biology a new direction.biology a new direction.

Characteristics of Biological DataCharacteristics of Biological Data

• Biological data exhibits many special Biological data exhibits many special characteristics that make management characteristics that make management of biological information a particularly of biological information a particularly challenging problem. challenging problem.

• The characteristics related to biological The characteristics related to biological information is called information is called Bioinformatics.Bioinformatics.

What is Bioinformatics?What is Bioinformatics?

• Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. • The ultimate goal of the field is to enable the

discovery of new biological insights as well as to create a global perspective from which unifying

principles in biology can be detected. • There are three important sub-disciplines within

bioinformatics which include:

1.1. The development of new algorithms and The development of new algorithms and statistics with which to assess relationships statistics with which to assess relationships among members of large data sets. among members of large data sets.

2.2. The analysis and interpretation of various types The analysis and interpretation of various types of data including nucleotide and amino acid of data including nucleotide and amino acid sequences, protein domains, and protein sequences, protein domains, and protein structures.structures.

3.3. The development and implementation of tools The development and implementation of tools that enable efficient access and management of that enable efficient access and management of different types of information.different types of information.

Biological Data + Computer Calculations

Bioinformatics

The Bioinformatics SpectrumThe Bioinformatics Spectrum

Various characteristicsVarious characteristics Biological data is highly complex when compared Biological data is highly complex when compared

with most other domains or applications.with most other domains or applications. The amount and range of variability in data is high.The amount and range of variability in data is high. Schemas in biological databases change at a rapid Schemas in biological databases change at a rapid

pace.pace. Representations of the same data by different Representations of the same data by different

biologists will likely be different (even using the biologists will likely be different (even using the same system).same system).

Most users of biological data do not require write Most users of biological data do not require write access to the database; read-only access is access to the database; read-only access is adequate. adequate.

Most biologists are not likely to have knowledge of the internal structure of the database or about schema design.

The context of data gives added meaning for its use in biological applications

Defining and representing complex queries is extremely important to the biologist.

Users of biological information often require access to “old” values of the data – particularly when verifying previously reported results.

What is the Human Genome?What is the Human Genome?

-The term genome is defined as the total genetic information that can be obtained about an entity.

E.g., the human genome generally refers to the complete set of genes required to create a human being.

-The number is estimated to be more than 30,000 genes spread over 23 pairs of chromosomes, with an estimated 3 to 4 billion nucleotides.

---The goal of the Human Genome Project (HGP Began in 1990 ) is to obtain the complete sequence – the ordering of the bases – of those nucleotides.

Existing Biological Databases.Existing Biological Databases.

• Some of the existing database systems that are Some of the existing database systems that are supporting or have grown out of the Human Genome supporting or have grown out of the Human Genome Project include:Project include:

• GenBankGenBank– The notable DNA sequence database in the world today is The notable DNA sequence database in the world today is

GenBank, maintained by the National Center for GenBank, maintained by the National Center for Biotechnology Information (Biotechnology Information (NCBINCBI) of the National Library of ) of the National Library of Medicine (Medicine (NLMNLM).).

– Established in 1978 as a secret storage for DNA sequence Established in 1978 as a secret storage for DNA sequence data.data.

– Since 1978 expanded to include sequence tag data, protein Since 1978 expanded to include sequence tag data, protein sequence data, three-dimensional protein structure, sequence data, three-dimensional protein structure, taxonomy, and links to the medical literature (MEDLINE).taxonomy, and links to the medical literature (MEDLINE).

- GenBank contains over 31 billion nucleotide bases of more than 24 million sequences from over 100,000 species with roughly 1400 new organisms being added each month.

-The database size in flat file format is over 100 GB uncompressed and has been doubling every 15 months.

-The system is maintained as a combination of flat files, relational databases, and files containing Abstract Syntax Notation One (ASN.1 rules for encoding and decoding data) .

• The Genome Database (GDB)The Genome Database (GDB)--Created in 1989, GDB is a catalog of human gene mapping Created in 1989, GDB is a catalog of human gene mapping data, a process that associates a piece of information with a data, a process that associates a piece of information with a

particular location on the human genome.particular location on the human genome. --The GDB system is built around Sybase, a The GDB system is built around Sybase, a commercial relational DBMS, and its data are commercial relational DBMS, and its data are modeled using standard Entity-Relationship modeled using standard Entity-Relationship techniques.techniques.------GDB distributes a Database Access Toolkit.------GDB distributes a Database Access Toolkit.

Online Mendelian Inheritance in ManOnline Mendelian Inheritance in Man

• Online Mandelian Inheritance in Man (Online Mandelian Inheritance in Man (OMIMOMIM) is ) is an electronic collection of information on the an electronic collection of information on the genetic basis of human disease. genetic basis of human disease.

• In 1991 its administration was transferred from In 1991 its administration was transferred from John Hopkins University to the NCBI(John Hopkins University to the NCBI(National National Center For Biotechnology InformationCenter For Biotechnology Information), and the ), and the entire database was converted to NCBI’s entire database was converted to NCBI’s GenBank format. Today it contains more than GenBank format. Today it contains more than 14,000 entries.14,000 entries.

EcoCycEcoCyc– The Encyclopedia of The Encyclopedia of Escherichia coliEscherichia coli

Genes and Metabolism (Genes and Metabolism (EcoCycEcoCyc) is a recent ) is a recent experiment in combining information about experiment in combining information about the genome and the metabolism of E.coli K-the genome and the metabolism of E.coli K-12(Bacteria).12(Bacteria).

– The database was created in 1996 as a The database was created in 1996 as a collaboration between Stanford Research collaboration between Stanford Research Institute and Marine Biological Laboratory.Institute and Marine Biological Laboratory.

Gene OntologyGene Ontology– Gene Ontology (GO) Consortium was formed in Gene Ontology (GO) Consortium was formed in

1998 as a collaboration among three model 1998 as a collaboration among three model organism databases: FlyBase, Mouse Genome organism databases: FlyBase, Mouse Genome Informatics (MGI) and Saccharomyces or yeast Informatics (MGI) and Saccharomyces or yeast Genome Database (SGD).Genome Database (SGD).

• The goal is to produce a structured, precisely defined, The goal is to produce a structured, precisely defined, common, controlled vocabulary for describing the roles of common, controlled vocabulary for describing the roles of genes and gene products in any organismgenes and gene products in any organism..

• Latest release of GO database has over 13,000 terms and more Latest release of GO database has over 13,000 terms and more than 18,000 relationships between terms.than 18,000 relationships between terms.

• GO was implemented using MySQL, an open source relational GO was implemented using MySQL, an open source relational database and a monthly database release is available in SQL and database and a monthly database release is available in SQL and XML(Extensible Markup Language) formats.XML(Extensible Markup Language) formats.

Summary Of the Major Summary Of the Major Genome-Related DatabasesGenome-Related Databases

Various Branches Benefited.Various Branches Benefited.

• Medicine• PharmacogenomicsPharmacogenomics• Biotechnology• Bioinformatics• Proteomics

Engineering

Genome data management