109
Pharmacoinformatics Pharmacoinformatics

Database Database Sree

Embed Size (px)

Citation preview

Pharmacoinformatics

Pharmacoinformatics is an emerging field that draws upon both Bioinformatics and Cheminformatics. The scientific or research aspect deals with the use of technology in drug discovery while the service aspect deals with monitoring patients of a drug The scope for jobs is essentially with companies involved in drug research and clinical research. National Institute of Pharmaceutical Education and Research (NIPER) in Punjab appears to be the only structured course in this area at the post graduate and the Ph D level. Bioinformatics Institute of India in NOIDA, Uttar Pradesh also claims to offer a Ph D in this area. This is an emerging field, placements are not clear and companies would probably view pharmacoinformatics at par with cheminformatics and bioinformatics. Most pharma and biotech companies are adopting a wait and watch policy and don't have full fledged department, IBM, Sun Microsystems and Oracle are significant players in the biosystems domain.

PharmacoinformaticsAgenda: Database Design Information Management Drug Information Services

Database Design:Structure of databases Sequence databases Relational databases Sequence analysis Software resources Sequence alignment Database searches Phylogentic analysis

Fundamentals of Database Design

AgendaIntroduction and participants needs We will review what is a database; Understand the difference between data and information; What is the purpose of a database system; How to select a database system; Database definitions and fundamental building blocks;

Agenda (2)Database development: the first steps; Quality control issues; Data entry considerations;

What is a databaseA database is any organized collection of data. Some examples of databases you may encounter in your daily life are:

a telephone book T.V. Guide airline reservation system motor vehicle registration records papers in your filing cabinet files on your computer hard drive. Banking

Data vs. information: What is the difference?What is data?

What is information?

Data can be defined in many ways. Information science defines data as unprocessed information.

Information is data that have been organized and communicated in a coherent and meaningful manner. Data is converted into information, and information is converted into knowledge. Knowledge; information evaluated and organized so that it can be used purposefully.

Why do we need a database?Keep records of our: Clients Staff Volunteers To keep a record of activities and interventions; Keep sales records; Develop reports; Perform research Longitudinal tracking

What is the ultimate purpose of a database management system?Is to transformData Information Knowledge Action

More about database definitionWhat is a database?Quite simply, its an organized collection of data. A database management system (DBMS) such as Access, FileMaker, Lotus Notes, Oracle or SQL Server which provides you with the software tools you need to organize that data in a flexible manner. It includes tools to add, modify or delete data from the database, ask questions (or queries) about the data stored in the database and produce reports summarizing selected contents.

For example: Databases in Bioinformatics

Outlook contacts Aspira Association MIS KidTrax GISGIS-GPS systems

Example: 2

What is a database? database?A collection of... structured searchable (index) updated periodically (release) cross-referenced (hyperlinks) crosshyperlinks) db data

-> table of contents -> new edition -> links with other

Includes also associated tools (software) necessary for db access, access, db updating, db information insertion, db updating, information deletion. deletion.

Types of DatabasesNonNon-relational databasesNonNon-relational databases place information in field categories that we create so that information is available for sorting and disseminating the way we need it. The data in a non-relational database, however, is limited to that program and noncannot be extracted and applied to a number of other software programs, or

other database files within a school or administrative system. The data can only be "copied and pasted. Example: a spread sheet

Relational databasesIn relational databases, fields can be used in a number of ways (and can be of variable length), provided that they are linked in tables. It is developed based on a database model that provides for logical connections among files (known as tables) by including identifying data from one table in another table

Data structureIn computer science, a data structure is a particular way of storing science, and organizing data in a computer so that it can be used efficiently. Data structures are used in almost every program or software system Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, B-trees are particularly well-suited for implementation of welldatabases, while compiler implementations usually use hash tables to look up identifiers.

Principle: Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by an address--a address--a bit string that can be itself stored in memory and manipulated by the program. The implementation of a data structure usually requires writing a set of procedures that create and manipulate instances of that structure.

Common data structures Array, --An array is a systematic arrangement of objects, Array, --An usually in rows and columns. "singly linked list, --linked list (or more clearly, "singly-linked list, --linked list") is a data structure that consists of a sequence of nodes each of which contains a reference (i.e., a link) to link) the next node in the sequence. hash-table,-hash table or hash map is a data structure hash-table, that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). heap, --heap is a specialized tree-based data structure heap, --heap treethat satisfies the heap property.

B-tree, --a B-tree is a tree data structure that keeps data tree, --a sorted and allows searches, sequential access, time. insertions, and deletions in logarithmic amortized time. red-black tree, -- a type of self-balancing binary search redtree, selftree, tree, a data structure used in computing science, science, arrays. typically used to implement associative arrays. --organize pieces of comparable data, such as text data, fragments or numbers trie.--a trie, or prefix tree, is an ordered tree data trie.--a trie, tree, structure that is used to store an associative array where strings. the keys are usually strings.

Language support: Most Assembly languages and some low-level languages lowex: BCPL generally lack support for data structures highlanguages, Many high-level programming languages, and some higherhigher-level assembly languages, ex: MASM, on the MASM, other hand, have special syntax or other built-in support builtfor certain data structures, Programming languages: supported with standard libraries that implement the most common data structures ex: the C++ Standard Template Library, the Java Collections Library, Framework, Framework, and Microsoft's .NET Framework. Microsoft's Framework.

Sequence database: ---In ---In the field of bioinformatics, a sequence database is a bioinformatics, large collection of computerized ("digital") nucleic acid ("digital") sequences, sequences, protein sequences, or other sequences sequences, stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae), or it can cerevisiae), include sequences from all organisms whose DNA has been sequenced. databasebiology, Ex: Protein structure database-- In biology, a protein structure database is a database that is modeled around the various experimentally determined protein structures. structures. The aim of most protein structure databases is to organize and annotate the protein structures, providing the biological community access to the experimental data in a useful way.

Examples of protein structure databases include (in alphabetical order);Database of Macromolecular Movements describes the motions that occur in proteins and other macromolecules, particularly using movies JenaLib the Jena Library of Biological Macromolecules is aimed at a better dissemination of information on three-dimensional threebiopolymer structures with an emphasis on visualization and analysis. MODBASE a database of threethree-dimensional protein models calculated by comparative modeling PDBe the European resource for the collection, organisation and dissemination of data on biological macromolecular Bank. browserstructures, and a member of the Worldwide Protein Data Bank. OCA a browser-database for protein structure/function - The OCA integrates information from KEGG, OMIM, PDBselect, Pfam, KEGG, OMIM, PDBselect, Pfam, PubMed, SCOP, SwissProt, PubMed, SCOP, SwissProt, and others. OPM provides spatial positions of protein threethreedimensional structures with respect to the lipid bilayer. PDB Lite derived from OCA, PDB Lite was bilayer. provided to make it as easy as possible to find and view a macromolecule within the PDB PDBsum provides an overview macromolecular structures in the PDB, giving schematic diagrams of the molecules in each structure and of the interactions between them PDBTM the Protein Data Bank of Transmembrane Proteins a selection of the PDB. PDBWiki a community annotated knowledge base of biological molecular structures [1] Protein the NIH protein database, a collection of sequences from several sources, including translations from annotated coding GenBank, TPA, SwissProt, PIR, PRF, regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB Proteopedia the collaborative, 3D encyclopedia of proteins and other molecules. A wiki that contains a page for every entry in the PDB (>50,000 pages), with a Jmol view that highlights functional sites and ligands. Offers an easy-to-use scene-authoring tool so you don't have to learn easy-tosceneJmol script language to create customized molecular scenes. Custom scenes are easily attached to "green links" in descriptive text that display those scenes in Jmol. SCOP the Structural Jmol. Classification of Proteins [2] a detailed and comprehensive description of the structural and SWISSevolutionary relationships between all proteins whose structure is known. SWISS-MODEL Repository a database of annotated protein models calculated by homology modeling TOPSAN the Open Protein Structure Annotation Network a wiki designed to collect, share and distribute information about protein three-dimensional structures. Retrieved from threehttp://en.wikipedia.org/wiki/Protein_structure_database" "http://en.wikipedia.org/wiki/Protein_structure_database"

Sequence analysis Def: The term "sequence analysis" in biology implies subjecting a "sequence analysis"DNA or peptide sequence to sequence alignment, sequence alignment, databases, databases, repeated sequence searches, or other bioinformatics methods on a computer.Sequence analysis in molecular biology and bioinformatics is an automated, computercomputer-based examination of characteristic fragments, e.g. of a DNA strand. It basically includes relevant topics: The comparison of sequences in order to find similarity and dissimilarity in compared sequences (sequence alignment) Identification of gene-structures, reading frames, distributions of introns and exons gene-structures, frames, and regulatory elements Finding and comparing point mutations or the single nucleotide polymorphism (SNP) in organism in order to get the genetic marker. Revealing the evolution and genetic diversity of organisms. Function annotation of genes. In chemistry, sequence analysis comprises techniques used to do determine the chemistry, sequence of a polymer formed of several monomers. In molecular biology and monomers. genetics, genetics, the same process is called simply "sequencing". "sequencing". In marketing, sequence analysis is often used in analytical customer relationship marketing, management applications, such as NPTB models (Next Product to Buy).

Sequence Analysis in Molecular Biology: DNA, RNA, Sequence Alignment is a way of arranging the sequences of DNA, RNA, or protein sequences to identify regions of similarity. It generally falls into two types: -Pairwise alignment: Alignment between two sequences -Multiple alignment: Alignment between more than two sequences Existing methods for pairwise alignment include: Needleman-Wunsch algorithm, Needlemanalgorithm, SmithSmith-Waterman algorithm, and BLAST algorithm, Existing methods for multiple alignment include: ClustalW , PROBCONS, MUSCLE, ClustalW, PROBCONS, MUSCLE, MAFFT, DIALIGN, Coffee, MAFFT, DIALIGN, T-Coffee, POA, and MANGO. MANGO. Motif Finding Motif Prediction Methodology The tasks that lie in the space of sequence analysis are often non-trivial to resolve nonand require the use of relatively complex approaches. Of the many types of methods used in practice, the most popular include: Artificial Neural Network, Network, Hidden Markov Model Support Vector Machine Clustering Bayesian Network Regression Analysis

List of Computational Chemistry Software ResourcesBioinformatics Software Cheminformatics Software LIMS Software ComputerComputer-Assisted Molecular Modeling Software CADD - Biopolymer Modeling Software CADD - General Modeling Software CADD - Conformational Search Software CADD - General Tools CADD - Molecular Mechanics/Dynamics Software CADD - Quantum Chemistry Software CADD - Display Software Structural Chemistry Software Structural Chemistry Software for Xray Analysis Structural Chemistry Software for IR Analysis Structural Chemistry Software for MS Analysis Structural Chemistry Software for NMR Analysis General Software Tools

Lists of Software for Bioinformatics: Sequence Databases : ex: AceDB (genome database ); The BioCyc (databases provides electronic reference sources on the pathways and genomes of different organisms ); Biopendium: (brings together information on sequence, structure and Biopendium: function relationships for all gene products in the public domain.); CAMELEON is a set of multiple sequence alignment tools with links to databases of known 3D structural fragments ); ERGO Light is a curated database of public and proprietary genomic DNA, with connected similarities, functions, pathways, functional models, 2clusters and more ; Expasy site contains a 2-D gel data database, searching engine and links to several gel databases throughout the world. ); GAIA 22 is a Chromosome 22 specific version of the GAIA database. GAIA is a data analysis and storage system for genomic sequence and its annotation. As a data analysis engine it accepts raw genomic sequence and automatically adds significant annotation ); GeneCards is a database of human genes, their products and their involvement in diseases ); GENESEQ was a database of protein and nucleic acid sequences extracted from world-wide patent documents ; GeneWorks - was an integrated worldsequence analysis and database searching ; ISYS(TM), is the National Center for ISYS(TM), Genome Resources' new product that integrates independent bioinformatic software tools and databases ); OligoMaster is a multi-user oligonucleotide cataloguing multiapplication designed to help biologists manage and organise their oligonucleotide collections, available in versions for Windows, Macintosh and Linux); PhyloPat provides phylogenetic pattern analysis of eukaryotic genes.; ProteinCenter() ProteinCenter() integrates the contents of a large number of public protein sequence databases and your experimental systems biology data. Relibase is a web-based tool for searching weband analysing protein ligand structures in the PDB);

ResNet is a comprehensive database of molecular networks and protein interactions, derived from automatic analysis of the whole PubMed.; The Rosetta Resolver PubMed.; System, System, provides high-capacity data storage, retrieval and analysis of gene highexpression data. The system is ideal for life science research organizations that need to assess compound specificity or toxicity, identify new genes or therapeutic targets, or compare and analyze large databases of expression profiles.; SGD is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast; SRS is a database integration and biological information search system. It is capable of quering 400 different molecular biology, bibliographic, compound data, genetic and medical databases via a single interface. ; Software Solution for BioMedicine (SSBM) offers high-speed analysis of both public and proprietary genetic databases highwithin the security of the corporate firewall; Vector NTI is a Macintosh- and WindowsMacintoshWindowsbased molecular biology support system .

Pathway Analysis Tools Structure Prediction and Analysis Tools Sequence Analysis Tools Sequence Management Tools Visualization Tools

Sequence Analysis Tools: Software resources: AAT - Analysis and Annotation Tool used to identify genes by comparing cDNA and protein sequence databases. ABI PRISM ; AcaClone pDRAW32 ; AGCT ; AlleleID ; Antheprot Protein analysis software ; Array Designer 4: arraySCOUTTM is a gene expression data analysis application ; Artemis is a 4: freelywebfree genome viewer and annotation tool ; Asterias is a suite of freely- accessible web-based genomic data analysis programs ; Bio Image is a life sciences software information company which carries a wide variety of electrophoresis image analysis software for Windows, Powermac, Powermac, and UNIX ; BioinformatiX is an enterprise software which provides an environment for the analysis of microarray data. ; BioRainbow Analysis Tools are a collection of software tools for binding site prediction, weight matrix search, regulatory sequences analysis, microarray analysis, footprint ; bioSCOUT is a comprehensive and customizable bioinformatics package ; BioTools bioSCOUT offers three primary bioinformatics products: GeneTool for DNA sequence analysis, PepTool for protein sequence analysis, and ChromaTool for chromatogram analysis; BlockSearch is a quantitative method for the elucidation of unknown protein functions; Bosque http://bosque.udec.cl) (http://bosque.udec.cl) is a distributed software environment oriented to manage the computational resources involved in typical phylogenetic analyses Clann: Software for Clann: CURVES, investigating phylogenomic information using supertrees ; CURVES, by Richard Lavery and Heinz Sklenar is a very useful nucleic acid helical analysis program. DNADynamois a general purpose DNADynamois software for DNA and Protein sequence analysis DNASIS is a robust sequence analysis software package that delivers industry standard functionality DNPTrapper is a shotgun sequencing assembly editing tool, specifically designed for finishing and analysis of repeated regions. EuGene and SAm is a menus based DNA and protein sequence analysis package Genchek , developed by Ocimum Biosolutions is a comprehensive, LIMS based, user friendly Nucleotide and Polypeptide Sequence Analysis Tool with a backend Relational Database Genehound() offers Genehound() a new, innovative, and exciting apporach to identifying coding regions in prokaryotic genomes GeneInform is an easy-to-operate gene expression management and analysis tool that saves easy-tocost and time by facilitating the collection, storage, analysis, and sharing of gene expression data

Gene Inspector()1.5: A powerful and versatile combination of an electronic laboratory notebook and sequence analysis package for biologists. GeneLinker products are the easiest way for researchers to start analyzing gene expression data. GeneJockey is a program for editing, manipulation, and analysis of nucleic acid and protein sequences. GENEMARK is a genefinding tool available from the Georgia Institute of Technology that uses an algorithm based on nonnonhomogenous Markov chain models. GENEPARSER is a coding region recognition program from the University of Colorado that uses potential similarity between query sequence and known GeneSifter, Webamino acid sequences. GeneSifter, a Web-based microarray analysis system that combines data management and analytical functions with integrated, current gene annotation from databases such as Unigene and LocusLink. GeneSolve is a single-User desktop sofware LocusLink. singlepackage for analyzing nucleic acid sequence infromationGeneStudio Pro from GeneStudio, Inc. infromationGeneStudio GeneStudio, (http://www.genestudio.com) is a newly developed suite of molecular biology programs for http://www.genestudio.com) WindowsGeneWorks WindowsGeneWorks - an integrated sequence analysis and database searching on the Macintosh previously marketed by Oxford Molecular GroupGenomeBrowser is a powerful GroupGenomeBrowser software tool that simplifies the proccess of analysis, annotation, and manipulation of genetic Genie, sequences. Genie, from LBNL, is a gene finder based on generalized hidden Markov models to locate multi-exon genes. Etc multi-

Relational Database Definition:

Data stored in tables that are associated by shared attributes (keys). Any data element (or entity) can be found in the database through the name of the table, the attribute name, and the value of the primary key.

Relational Database DefinitionsEntity: Object, Concept or event (subject) Attribute: a Characteristic of an entity Row or Record: the specific characteristics of one entity Table: a collection of records Database: a collection of tables

Overview of Phylogenetic Analysis Phylogenetic analysis is the process you use to determine the evolutionary relationships between organisms. The results of an analysis can be drawn in a hierarchical diagram called a cladogram or phylogram (phylogenetic tree). The branches in a tree are based on the hypothesized evolutionary relationships (phylogeny) between organisms. Each member in a branch, also known as a monophyletic group, is assumed to be descended from a common ancestor. Originally, phylogenetic trees were created using morphology, but now, determining evolutionary relationships includes matching patterns in nucleic acid and protein sequences.

Example: -----phylogenetic -----phylogenetic tree is constructed from mitochondrial DNA (mtDNA) sequences for the family Hominidae. This family includes gorillas, chimpanzees, orangutans, and humans. Searching NCBI for Phylogenetic Data The NCBI taxonomy Web site includes phylogenetic and taxonomic information from many sources. These sources include the published literature, Web databases, and taxonomy experts. And while the NCBI taxonomy database is not a phylogenetic or taxonomic authority, it can be useful as a gateway to the NCBI biological sequence databases

Principles of data organization

Database --a collection of related structured information about entities --a File -- a collection of records Record--a Record--a set of fields Field --a single characteristic of an entity --a Character--a Character--a symbol used in data field

Selecting a Database Management SystemDatabase management systems (or DBMSs) can be divided into two categories -- desktop databases and server databases. Generally speaking, desktop databases are oriented toward singlesingle-user applications and reside on standard personal computers (hence the term desktop). Server databases contain mechanisms to ensure the reliability and consistency of data and are geared toward multi-user multiapplications.

Selecting a database system: Need AnalysisThe needs analysis process will be specific to your organization but, at a minimum, should answer the following questions: How many records we will warehouse and for how long? Who will be using the database and what tasks will they perform? How often will the data be modified? Who will make these modifications? Who will be providing IT support for the database? What hardware is available? Is there a budget for purchasing additional hardware? Who will be responsible for maintaining the data? Will data access be offered over the Internet? If so, what level of access should be supported?

Some DefinitionsA File: A group or collection of similar records, like INST6031 Fall Student File, American History 1850-1866 file, Basic Food Group 1850Nutrition File A record book: a "rolodex" of data records, like address lists, inventory lists, classes or thematic units, or groupings of other unique records that are combined into one list (found in AppleWorks, FileMaker Pro software). A field: one category of information, i.e., Name, Address, Semester field: Grade, Academic topic A record: one piece of data, i.e., one student's information, a recipe, record: a test question A layout: a design for a database that contains field names and layout: possibly graphics.

Database glossary

Fundamental building blocksTables comprise the fundamental building blocks of any database. If you're familiar with spreadsheets, you'll find database tables extremely similar. Take a look at this example of a table sample database:

The table above contains the employee information for our organization -- characteristics like name, date of birth and title. Examine the construction of the table and you'll find that each column of the table corresponds to a specific employee characteristic (or attribute in database terms). Each row corresponds to one particular employee and contains his or her information. That's all there is to it! If it helps, think of each one of these tables as a spreadsheetspreadsheet-style listing of information.

Where do we start?Lets explore your paper system

Client intake forms Job application form Funders reports Define required fields from forms or required reports Avoid repetition Keep it simple Identify a unique identifier or primary key

Database modeling:

Some Quality Control ConsiderationsRemember garbage in garbage out. Some examples and how to prevent this. Quality management encompasses three distinct processes: quality planning, quality control, and quality improvement Quality Planning in relation to database systems design: Who will perform data entry? Training? On-line help? On How data entry will be performed?

Data entry considerationsDefine must enter fields no record is complete unless: such and such is entered; Make data entry fool proof. Example: Grade level can be entered as a number (8 or 8th or eight). By using a pullpull-down menu with the correct data format these mistakes can be avoided.

Data Entry additional considerationsBarcode scanners

USB or Wireless attached to a Palm or Pocket PC WiFi 802.11g, Bluetooth Wireless networks (real(real-time on demand systems)

Pocket PC

PEOPLE THAT WORK WITH DATABASES System Analysts Database Designers Application Developers Database Administrators End Users

System Analystscommunicate with each prospective database user group in order to understand its information needs processing needs develop a specification of each user groups information and processing needs develop a specification integrating the information and processing needs of the user groups document the specification

Database Designerschoose appropriate structures to represent the information specified by the system analysts choose appropriate structures to store the information in a normalized manner in order to guarantee integrity and consistency of data choose appropriate structures to guarantee an efficient system document the database design

Application Developersimplement the database design implement the application programs to meet the program specifications test and debug the database implementation and the application programs document the database implementation and the application programs

Database AdministratorsManage the database structure Manage data activity Manage the database management systemgenerate database application performance reports investigate user performance complaints assess need for changes in database structure or application design modify database structure evaluate and implement new DBMS features tune the database Establish the database data dictionary data names, formats, relationships crosscross-references between data and application programs

End UsersParametric end users constantly query and update the database. They use canned transactions to support standard queries and updates. Casual end users occasional access the database, but may need different information each time. They use sophisticated query languages and browsers. Sophisticated end users have complex requirement and need different information each time. They are thoroughly familiar with the capabilities of the DBMS.