Ad Bms Notes

  • Upload
    vikas

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 Ad Bms Notes

    1/44

    UNIT-1

    Query processing is the process by which a declarative query is translated into low-level datamanipulation operations. SQL is the standard query language that is supported in currentDBMSs.

    Query Processing steps:

    Parsing and Translatingo Translate the query into its internal form (parse tree).o This is then translated into an expression of the relational algebra.o Parser checks syntax, validates relations, attributes and access permissions

    Evaluation

    o The query execution engine takes a physical query plan (aka execution plan),executes the plan, and returns the result.

    Optimization: Find the cheapest" execution plan for a query

    A relational algebra expression may have many equivalent expressions, e.g.,

    CName(_Price>5000((CUSTOMERS ORDERS) OFFERS))CName((CUSTOMERS ORDERS) (_Price>5000(OFFERS)))

    Representation as logical query plan (a tree):

  • 8/10/2019 Ad Bms Notes

    2/44

    Non-leaf nodes = operations of relational algebra (with parameters); Leaf nodes = relations

    A relational algebra expression can be evaluated in many ways. An annotated expressionspecifying detailed evaluation strategy is called the execution plan (includes, e.g.,

    whether index is used, join algorithms, . . . ) Among all semantically equivalent expressions, the one with the least costly evaluation

    plan is chosen. Cost estimate of a plan is based on statistical information in the systemcatalogs.

    Query optimization refers to the process by which the best execu tion strategy for a given query isfound from among a set of alternatives.

    The process typically involves two steps:

    Query Decomposition: Query decomposition takes an SQL query and translates it into relational algebra.In the process, the query is analyzed semantically so that incorrect queries are detected and rejected aseasily as possible, and correct queries are simplified. Simplification involves the elimination of redundantpredicates which may be introduced as a result of query modification to deal with views, securityenforcement and semantic integrity control. The simplified query is then restructured as an algebraicquery.

    Query Optimization: For a given SQL query, there are more than one possible relation algebraicexpressions. Some of these algebraic expressions are better than others. The quality of an algebraicexpression is defined in terms of expected performance.The traditional procedure is to obtain an initial algebraic expression by translating the predicates and thetarget statement into relational operations as they appear in the query. This initial algebraic query is then

    transformed, using algebraic transformation rules, into other algebraic queries until the best one isfound.The best algebraic expressionis determined according to a cost function which calculates the cost ofexecuting the query according to that algebraic specification. This is the process of query optimization.

  • 8/10/2019 Ad Bms Notes

    3/44

    Optimization typically takes one of two forms:We divide the query optimization into two types: Heuristic (sometimes called Rule based) andSystematic (Cost based).

    Heuristic Optimization

    Cost Based Optimization

    In Heuristic Optimization, the query execution is refined based on heuristic rulesfor reorderingthe individual operations.

    With Cost Based Optimization, the overall cost of executing the query is systematicallyreduced by estimating the costs of executing several different execution plans.

    Heuristic Query Optimization

    In this method relational algebra expressions are expressed in equivalent expressions that take

    much less time and resource to process. As we illustrated, repositioning relational algebraoperations in certain ways does not affect the results. First we present an example to show the

    effect of this repositioning and then present a list of heuristic rules for optimizing relational algebra

    expressions. Once an expression is optimized, it can then be implemented efficiently.

    A query can be represented as a tree data structure. Operations are at the interior nodesand data items (tables, columns) are at the leaves.

    The query is evaluated in a depth-firstpattern.

    For Example:

    SELECT PNUMBER, DNUM, LNAMEFROM PROJECT, DEPARTMENT, EMPLOYEE

    WHERE DNUM=DNUMBER and MGRSSN=SSN andPLOCATION = 'Stafford';

    Or, in relational algebra:

    on the following schema:

    EMPLOYEE TABLEFNAME MI LNAME SSN BDATE ADDRESS S SALARY SUPERSSN DNO

    -------- -- ------- --------- --------- ------------------------- - ------ --------- --JOHN B SMITH 123456789 09-JAN-55 731 FONDREN, HOUSTON, TX M 30000 333445555 5

    FRANKLIN T WONG 333445555 08-DEC-45 638 VOSS,HOUSTON TX M 40000 888665555 5ALICIA J ZELAYA 999887777 19-JUL-58 3321 CASTLE, SPRING, TX F 25000 987654321 4JENNIFER S WALLACE 987654321 20-JUN-31 291 BERRY, BELLAIRE, TX F 43000 888665555 4

    RAMESH K NARAYAN 666884444 15-SEP-52 975 FIRE OAK, HUMBLE, TX M 38000 333445555 5

    JOYCE A ENGLISH 453453453 31-JUL-62 5631 RICE, HOUSTON, TX F 25000 333445555 5

  • 8/10/2019 Ad Bms Notes

    4/44

    AHMAD V JABBAR 987987987 29-MAR-59 980 DALLAS, HOUSTON, TX M 25000 987654321 4JAMES E BORG 888665555 10-NOV-27 450 STONE, HOUSTON, TX M 55000 1

    DEPARTMENT TABLE:

    DNAME DNUMBER MGRSSN MGRSTARTD--------------- --------- --------- ---------

    HEADQUARTERS 1 888665555 19-JUN-71ADMINISTRATION 4 987654321 01-JAN-85RESEARCH 5 333445555 22-MAY-78

    PROJECT TABLE:PNAME PNUMBER PLOCATION DNUM

    ---------------- ------- ---------- ----

    ProductX 1 Bellaire 5ProductY 2 Sugarland 5

    ProductZ 3 Houston 5

    Computerization 10 Stafford 4Reorganization 20 Houston 1

    NewBenefits 30 Stafford 4

    WORKS_ON TABLE:

    ESSN PNO HOURS

    --------- --- -----123456789 1 32.5

    123456789 2 7.5666884444 3 40.0453453453 1 20.0

    453453453 2 20.0333445555 2 10.0

    333445555 3 10.0333445555 10 10.0

    333445555 20 10.0

    999887777 30 30.0999887777 10 10.0

    987987987 10 35.0

    987987987 30 5.0987654321 30 20.0

    987654321 20 15.0888665555 20 null

    Which of the following query trees is more efficient ?

    The left hand tree is evaluated in steps as follows:

  • 8/10/2019 Ad Bms Notes

    5/44

    The right hand tree is evaluated in steps as follows:

    Note the two cross product operations. These require lots of space and time (nestedloops) to build.

    After the two cross products, we have a temporary table with 144 records (6 projects * 3

    departments * 8 employees). An overall rule for heuristic query optimization is to perform as many select and project

    operations as possible before doing any joins. There are a number of transformation rules that can be used to transform a query:

    1. Cascading selections. A list of conjunctive conditions can be broken up intoseparate individual conditions.

  • 8/10/2019 Ad Bms Notes

    6/44

    c1(c2(E)) = c1 c2(E)

    2. Commutativity of the selection operation.3. Cascading projections. All but the last projection can be ignored.

    Assume that attributes A1, . . . ,An are among B1, . . . ,Bm. Then

    A1,...,An( B1,...,Bm(E)) = A1,...,An(E)

    4. Commuting selection and projection. If a selection condition only involvesattributes contained in a projection clause, the two can be commuted.

    5. Commutativity of Join and Cross Product.6. Commuting selection with Join.

    If c only involves attributes from E1,then

    c(E1 E2) = c(E1) E27. Commuting projection with Join.8. Commutativity of set operations. Union and Intersection are commutative.9. Associativity of Union, Intersection, Join and Cross Product.10.Commuting selection with set operations.

    c(E1 E2) = c(E1) c(E2)

    11.Commuting projection with set operations.

    A1,...,An(E1 E2) = A1,...,An(E1) A1,...,An(E2)

    12.Logical transformation of selection conditions. For example, using DeMorgan'slaw, etc.

    13.Combine Selection and Cartesian product to form Joins.

    Systematic (Cost based) Query Optimization

    Just looking at the Syntax of the query may not give the whole picture - need to look atthe data as well.

    Several Cost components to consider:1. Access costto secondary storage (hard disk)2. Storage Costfor intermediate result sets3. Computation costs: CPU, memory transfers, etc. for performing in-memory

    operations.4. Communications Coststo ship data around a network. e.g., in a distributed or

    client/server database.

  • 8/10/2019 Ad Bms Notes

    7/44

    Of these, Access cost is the most crucial in a centralized DBMS. The more work we cando with data in cache or in memory, the better.

    Access Routinesare algorithms that are used to access and aggregate data in a database. An RDBMS may have a collection of general purpose access routines that can be

    combined to implement a query execution plan.

    We are interested in access routines for selection, projection, join and set operations suchas union, intersection, set difference, cartesian product, etc. As with heuristic optimization, there can be many different plans that lead to the same

    result. In general, if a query contains noperations, there will be n!possible plans.

    However, not all plans will make sense. We should consider:Perform all simple selections firstPerform joins nextPerform projection last

    Overview of the Cost Based optimization process1. Enumerate all of the legitimate plans (call these P1...Pn) where each plan contains

    a set of operations O1...Ok2. Select a plan3. For each operation Oiin the plan, enumerate the access routines4. For each possible Access routine for Oi, estimate the cost

    Select the access routine with the lowest cost5. Repeat previous 2 steps until an efficient access routine has been selected for each

    operationSum up the costs of each access routine to determine a total cost for the plan

    6. Repeat steps 2 through 5 for each plan and choose the plan with the lowest totalcost.

    Catalog Information for Cost Estimation

    Information about relations and attributes: NR: number of tuples in the relation R. BR: number of blocks that contain tuples of the relation R. SR: size of a tuple of R. FR: blocking factor; number of tuples from R that fit into one block

    (FR= [NR/BR]) V(A,R): number of distinct values for attribute A in R. SC(A, R): selectivity of attribute A

    =average number of tuples of R that satisfy an equality condition on A.

    SC(A, R) = NR/V(A, R).

    Information about indexes: HTI: number of levels in index I (B+-tree). LBI: number of blocks occupied by leaf nodes in index I (first-level blocks). ValI: number of distinct values for the search key.

    Measures of Query Cost

  • 8/10/2019 Ad Bms Notes

    8/44

    There are many possible ways to estimate cost, e.g., based on disk accesses, CPUtime, or communication overhead.

    Disk access is the predominant cost (in terms of time); relatively easy to estimate;therefore, number of block transfers from/to disk is typically used as measure.Simplifying assumption: each block transfer has the same cost.

    Cost of algorithm (e.g., for join or selection) depends on database buffer size; morememory for DB buffer reduces disk accesses. Thus DB buffer size is a parameter forestimating cost.

    We refer to the cost estimate of algorithm S as cost(S). We do not consider cost ofwriting output to disk.

    Relational Algebra Equivalences:

    Equivalence Rules (for expressions E, E1, E2, conditions Fi) Applying distribution andcommutativity of relational algebra operations

    1.F1(F2(E)) = F1^F2(E)

    2. F(E1[U, , --] E2) = F(E1) [U,,--] F(E2)

    3. F(E1 X E2) = F0(F1(E1) X F2(E2));

    F =F0 ^ F1 ^ F2, Fi contains only attributes of Ei; i = 1; 2.

    4. A=B(E1 X E2) = E1(A=B)E2

    5. A(E1 [U,,--] E2) A(E1) [U,,--] A(E2)

    6. A(E1 X E2) = A1(E1) X A2(E2) with Ai = A {attributes in Ei}, i = 1, 2.

    7. E1 [U,] E2 = E2 [U,] E1

    (E1 U E2) U E3 = E1 U (E2 U E3) (the analogous holds for )

    8. E1 X E2 = A1,A2(E2 X E1)

    (E1 X E2) X E3 = E1 X (E2 X E3)

    (E1 X E2) X E3 = ((E1 X E3) X E2)

    9. E1 E2 = E2 E1

    (E1 E2) E3 = E1 (E2 E3)

  • 8/10/2019 Ad Bms Notes

    9/44

    UNIT-2

    Disadvantages of RDBMS

    RDBMSs are not suitable for applications with complex data structures or new data types

    for large, unstructured objects, such as CAD/CAM, Geographic information systems,multimedia databases, imaging and graphics.

    The RDBMSs typically do not allow users to extend the type system by adding new datatypes.

    They also only support first-normal-form relations in which the type of every columnmust be atomic, i.e., no sets, lists, or tables are allowed inside a column.

    Recursive queries are difficult to write.

    MOTIVATING EXAMPLE

    As a specific example of the need for object-relational systems, we focus on a new business dataprocessing problem that is both harder and (in our view) more entertaining than the dollars andcents bookkeeping of previous decades. Today, companies in industries such as entertainmentare in the business of selling bits; their basic corporate assets are not tangible products, but rathersoftware artifacts such as video and audio.We consider the fictional Dinky Entertainment Company, a large Hollywood conglomeratewhose main assets are a collection of cartoon characters, especially the cuddly andinternationally beloved Herbert the Worm. Dinky has a number of Herbert the Worm films,many of which are being shown in theaters around the world at any given time. Dinky alsomakes a good deal of money licensing Herbert's image, voice, and video footage for variouspurposes: action figures, video games, product endorsements, and so on. Dinky's database is used

    to manage the sales and leasing records for the various Herbert-related products, as well as thevideo and audio data that make up Herbert's many films.

    Traditional database systems, such as RDBMS, have been quite successful in developing thedatabase technology required for many traditional business database applications. However, theyhave certain shortcomings when more complex database applications must be designed andimplementedfor example, databases for engineering design and manufacturing (CAD/CAM ),scientific experiments, telecommunications, geographic information systems, and multimedia.These newer applications have requirements and characteristics that differ from those oftraditional business applications, such as more complex structures for objects, longer-durationtransactions, new data types for storing images or large textual items, and the need to define

    nonstandard application-specific operations.

    Object-oriented databases were proposed to meet the needs of these more complex applications.The object-oriented approach offers the flexibility to handle some of these requirements withoutbeing limited by the data types and query languages available in traditional database systems. Akey feature of object-oriented databases is the power they give the designer to specify both thestructure of complex objects and the operations that can be applied to these objects.

  • 8/10/2019 Ad Bms Notes

    10/44

    Object database systems combine the classical capabilities of relational database management

    systems (RDBMS), with new functionalities assumed by the object-orientedness. The traditional

    capabilities include:

    Secondary storage management

    Schema management

    Concurrency control

    Transaction management, recovery Query processing

    Access authorization and control, safety, security

    New capabilities of object databases include:

    Complex objects

    Object identities

    User-defined types

    Encapsulation Type/class hierarchy with inheritance

    Overloading, overriding, late binding, polymorphism

    Mandatory features of object-oriented systems

    Support for complex objects

  • 8/10/2019 Ad Bms Notes

    11/44

    A complex object mechanism allows an object to contain attributes that can themselves beobjects. In other words, the schema of an object is not in first-normal-form. Examples ofattributes that can comprise a complex object include lists, bags, and embedded objects.

    Object identity

    Every instance in the database has a unique identifier (OID), which is a property of an objectthat distinguishes it from all other objects and remains for the lifetime of the object. Inobject-oriented systems, an object has an existence (identity) independent of its value.

    Each database object has identity, i.e. a unique internal identitifier (OID) (with no meaning in the

    problem domain). Each object has one or more externalnames that can be used to identify the object by

    the programmer.

    Properties of OID:

    It is unique

    It is system generated

    It is invisible to the user. That is it cannot be modified by the user.

    It is immutable. That is, once generated, it is never regenerated. It is a long integer value

    EncapsulationObject-oriented models enforce encapsulation and information hiding. This means, the state ofobjects can be manipulated and read only by invoking operations that are specified within thetype definition and made visible through the public clause.In an object-oriented database system encapsulation is achieved if only the operations arevisible to the programmer and both the data and the implementation are hidden.

    Support for types or classes

    Type: in an object-oriented system, summarizes the common features of a set of objects

    with the same characteristics. In programming languages types can be used atcompilation time to check the correctness of programs.

    Class: The concept is similar to type but associated with run-time execution. The termclass refers to a collection of all objects with the same internal structure (attributes) andmethods. These objects are called instances of the class.

    Both of these two features can be used to group similar objects together, but it is normalfor a system to support either classes or types and not both.

    Class or type hierarchiesAny subclass or subtype will inherit attributes and methods from its superclass or supertype.

    Overriding, Overloading and Late Binding

    Overloading:A class modifies an existing method, by using the same name, but with adifferent list, or type, of parameters.

    Overriding: The implementation of the operation will depend on the type of the object it isapplied to.

    Late binding: The implementation code cannot be referenced until run-time.

    Computational Completeness

  • 8/10/2019 Ad Bms Notes

    12/44

    SQL does not have the full power of a conventional programming language. Languages such asPascal or C are said to be computationally complete because they can exploit the fullcapabilities of a computer. SQL is only relationally complete, that is, it has the full power ofrelational algebra. Whilst any SQL code could be rewritten as a C++ program, not all C++programs could be rewritten in SQL.

    Mandatory features of database systems

    A database is a collection of data that is organized so that its contents can easily be accessed,managed, and updated. Thus, a database system contains the five following features:

    PersistenceAs in a conventional database, data must remain after the process that created it hasterminated. For this purpose data has to be stored permanently on secondary storage.

    Secondary Storage ManagementTraditional databases employ techniques, which manage secondary storage in order to improvethe performance of the system. These are usually invisible to the user of the system.

    ConcurrencyThe system should provide a concurrency mechanism, which is similar to the concurrencymechanisms in conventional databases.

    RecoveryThe system should provide a recovery mechanism similar to recovery mechanisms inconventional databases.

    Ad hoc query facilityThe database should provide a high-level, efficient, application independent query facility.This needs not necessarily be a query language but could instead, be some type of graphicalinterface.

    Structured Data types:

    A structured data typeis a form of user-defined data type that contains a sequence of attributes, each of

    which has a data type. An attribute is a property that helps describe an instance of the type. For

    example, if we were to define a structured type called address_t, citymight be one of the attributes of

    this structured type. Structured types make it easy to use data, such as an address, either as a single

    unit, or as separate data items, without having to store each of those items (or attributes) in a separate

    column.

    A structured data type can be used as the type for a column in a regular table, the type for an entire

    table (or view), or as an attribute of another structured type. When used as the type for a table, thetable is known as a typed table.

    Structured data types exhibit a behavior known as inheritance. A structured type can have subtypes,

    other structured types that reuse all of its attributes and contain their own specific attributes. The type

    from which a subtype inherits attributes is known as its supertype.

    For Example:

    We have to create table employee

  • 8/10/2019 Ad Bms Notes

    13/44

    Name Age Salary Address

    FName LName street City province Postal_code

    create type address_tas (street varchar(12), city varchar(12), provincevarchar(12), postal_code char(6));

    create type Name_t as (FName varchar(12),LName varchar(20));

    create a new structure type by inheriting these two structure types

    create type employee_t as(emp_id integer, ename Name_t, address address_t);

    now we can create table of the above structure type

    create table employee of employee_tREF is emp_id system generated ;

    We can also declare array type to define multivalued attributes

    For Example:

    Create type phone_t as( phoneno char(10) array[3]);

    Here user can save three phone nos of a employee.

    Complex objects, object identity. The database should consist of objects having arbitrary complexity

    and an arbitrary number of hierarchy levels. Objects can be aggregates of (sub-) objects.

    An object typically has two components: state (value) and behavior (operations). Hence, it is somewhat

    similar to a program variable in a programming language, except that it will typically have a complexdata structure as well as specific operations defined by the programmer.

    Types of objects:

    Transient objects: Objects in an OOPL exist only during program execution and are hence called

    transient objects.

    Persistent objects: An OO database can extend the existence of objects so that they are stored

    permanently, and hence the objects persist beyond program termination and can be retrieved later and

    shared by other programs. In other words, OO databases store persistent objects permanently on

    secondary storage, and allow the sharing of these objects among multiple programs and applications.

    This requires the incorporation of other well-known features of database management systems, such as

    indexing mechanisms, concurrency control, and recovery. An OO database system interfaces with one ormore OO programming languages to provide persistent and shared object capabilities.

    Relationships, associations, links. Objects are connected by conceptual links. For instance, the

    Employeeand Departmentobjects can be connected by a link worksFor. In the data structure links

    are implemented as logical pointers (bi-directional or uni-directional).

  • 8/10/2019 Ad Bms Notes

    14/44

    Encapsulation and information hiding. The internal properties of an object are subdivided into two

    parts: public (visible from the outside) and private (invisible from the outside). The user of an object

    can refer to public properties only.

    Classes, types, interfaces. Each object is an instance of one or more classes. The class is understood

    as a blueprint for objects; i.e. objects are instantiated according to information presented in the class

    and the class contains the properties that are common for some collection of objects (objectsinvariants). Each object is assigned a type. Objects are accessible through their interfaces, which

    specify all the information that is necessary for using objects.

    Abstract data types (ADTs): a kind of a class, which assumes that any access to an object is limited to

    the predefined collection of operations.

    Operations, methods and messages. An object is associated with a set of operations (called

    methods). The object performs the operation after receiving a message with the name of operation

    to be performed (and parameters of this operation).

    Inheritance. Classes are organized in a hierarchy reflecting the hierarchy of real world concepts. For

    instance, the class Person is a super class of the classes Employeeand Student. Properties of more

    abstract classes are inherited by more specific classes. Multi-inheritance means that a specific class

    inherits from several independent classes.

    Polymorphism, late binding, overriding. The operation to be executed on an object is chosen

    dynamically, after the object receives the message with the operation name. The same message sent

    to different objects can invoke different operations.

    Persistence. Database objects are persistent, i.e., they live as long as necessary. They can outlive

    programs, which created these objects.

    Object Database Management Group (ODMG).

    Special interest group to develop standards that allow ODBMS customers to write portableapplications

    Standards include:

    Object Model

    Object Specification Languages

    Object Definition Language (ODL) for schema definition

    Object Interchange Format (OIF) to exchange objects between databases

    Object Query Language

    declarative language to query and update database objects

    Language Bindings (C++, Java, Smalltalk)

    Object manipulation language

    Mechanisms to invoke OQL from language Procedures for operation on databases and transactions

    CHALLENGES IN IMPLEMENTING AN ORDBMS

    The enhanced functionality of ORDBMSs raises several implementation challenges. Some of these are

    well understood and solutions have been implemented in products, others are subjects of current

  • 8/10/2019 Ad Bms Notes

    15/44

  • 8/10/2019 Ad Bms Notes

    16/44

    (e.g., the R-tree, which matches conditions such as Find me all theaters within 100 miles ofAndorra).

    One way to make the set of index structures extensible is to publish an access method interfacethat lets users implement an index structure outside of the DBMS. The index and data can be

    stored in a file system, and the DBMS simply issues the open , next ,and close iterator requests tothe users external index code. Such functionality makes it possible for a user to connect aDBMS to a Web search engine, for example. A main drawback of this approach is that data in anexternal index is not protected by the DBMSs support for concurrency and recovery. An

    alternative is for the ORDBMS toprovide a generic template index structure that is sufficientlygeneral to encompass most index structures that users might invent. Because such a structure isimplemented within the DBMS, it can support high concurrency and recovery. The GeneralizedSearch Tree (GiST) is such a structure. It is a template index structure based on B+trees, whichallows most of the tree index structures invented so far to be implemented with only a few linesof user-defined ADT code.

    Query Processing

    ADTs and structured types call for new functionality in processing queries in ORDBMSs. They

    also change a number of assumptions that affect the efficiency of queries. In this section we lookat two functionality issues (user-defined aggregates and security) and two efficiency issues(method caching and pointer swizzling).

    User-Defined Aggregation Functions

    Since users are allowed to define new methods for their ADTs, it is not unreasonable to expect

    them to want to define new aggregation functions for their ADTs as well. For example, the usualSQL aggregatesCOUNT, SUM, MIN, MAX, AVGare not particularly appropriate for theImage type schema.

    Most ORDBMSs allow users to register new aggregation functions with the system. To registeran aggregation function, a user must implement three methods, which we will call initialize,iterate, and terminate. The initialize method initializes the internal state for the aggregation. Theiterate method updates that state for every tuple seen, while the terminate method computes theaggregation result based on the final state and then cleans up. As an example, consider anaggregation function to compute the second-highest value in a field. The initialize call wouldallocate storage for the top two values, the iterate call would compare the current tuples valuewith the top two and update the top two as necessary, and the terminate call would delete the

    storage for the top two values, returning a copy of the second-highest value.

    Method Security

    ADTs give users the power to add code to the DBMS, this power can be abused. A buggy ormalicious ADT method can bring down the database server or even corrupt the database. TheDBMS must have mechanisms to prevent buggy or malicious user code from causing problems.

    It may make sense to override these mechanisms for efficiency in production environments with

  • 8/10/2019 Ad Bms Notes

    17/44

    vendor-supplied methods. However, it is important for the mechanisms to exist, if only tosupport debugging of ADT methods, otherwise method writers would have to write bug-freecode before registering their methods with the DBMSnot a very forgiving programmingenvironment.One mechanism to prevent problems is to have the user methods be interpretedrather than compiled . The DBMS can check that the method is well behaved either by restricting

    the power of the interpreted language or by ensuring that each step taken by a method is safebefore executing it. Typical interpreted languages for this purpose include Java and theprocedural portions of SQL:1999

    An alternative mechanism is to allow user methods to be compiled from a general-purposeprogramming language such as C++, but to run those methods in a different address space thanthe DBMS. In this case the DBMS sends explicit interprocess communications (IPCs) to the usermethod, which sends IPCs back in return. This approach prevents bugs in the user methods (e.g.,stray pointers) from corrupting the state of the DBMS or database and prevents maliciousmethods from reading or modifying the DBMS state or database as well. Note that the userwriting the method need not know that the DBMS is running the method in a separate process:The user code can be linked with a wrapper that turns method invocations and return values

    into IPCs

    Method Caching

    User-defined ADT methods can be very expensive to execute and can account for the bulk of thetime spent in processing a query. During query processing it may make sense to cache the resultsof methods, in case they are invoked multiple times with the same argument. Within the scope ofa single query, one can avoid calling a method twice on duplicate values in a column by eithersorting the table on that column or using a hash-based scheme much like that used foraggregation. An alternative is to maintain a cache of method inputs and matching outputs as atable in the database. Then to find the value of a method on particular inputs, we essentially join

    the input tuples with the cache table. These two approaches can also be combined.

    Pointer Swizzling

    In some applications, objects are retrieved into memory and accessed frequently through theiroids, dereferencing must be implemented very efficiently. Some systems maintains table of oidsof objects that are (currently) in memory. When an object O is brought into memory, they checkeach oid contained in O and replace oids of in-memory objects by in-memory pointers to thoseobjects. This technique is called pointer swizzling and makes references to in-memory objectsvery fast. The downside is that when an object is paged out, in-memory references to it mustsomehow be invalidated and replaced with its oid.

    Query Optimization

    New indexes and query processing techniques widen the choices available to a query optimizer.In order to handle the new query processing functionality, an optimizer must know about the newfunctionality and use it appropriately. In this section we discuss two issues in exposinginformation to the optimizer (new indexes and ADT method estimation) and an issue in queryplanning that was ignored in relational systems (expensive selection optimization).

  • 8/10/2019 Ad Bms Notes

    18/44

    Registering Indexes with the Optimizer

    As new index structures are added to a systemeither via external interfaces or built-in templatestructures like GiSTsthe optimizer must be informed of their existence, and their costs ofaccess. In particular, for a given index structure the optimizer must know (a) what WHERE -

    clause conditions are matched by that index, and (b) what the cost of fetching a tuple is for thatindex. Given this information, the optimizer can use any index structure in constructing a query

    plan. Different ORDBMSs vary in the syntax for registering new index structures. Most systemsrequire users to state a number representing the cost of access, but an alternative is for the DBMSto measure the structure as it is used and maintain running statistics on cost.

    Expensive selection optimization

    In relational systems, selection is expected to be a zero-time operation. For example, it requiresno I/Os and few CPU cycles to test if emp.salary

  • 8/10/2019 Ad Bms Notes

    19/44

    important goal for an

    ORDBMS.andworking on them for long periods,with related objects (e.g., objectsreferenced

    by the original objects) fetchedoccasionally.

    Transactions are short and ad-hoc in nature

    Transactions are complex andare of long duration

    Transactions are assumed to beshort and ordinary mechanismsof RDBMS are used to managethem.

    Every record is uniquely

    identified by primary key

    Here every object is uniquely

    identified by system generated

    Object ID

    Here every object is uniquely

    identified by system generated

    Object ID

    RDBMS is suitable for small

    database management systems

    like Hotel management,

    university management, shop

    management, etc.

    OODBMS is suitable for

    advanced applications like:

    Computer Integrated

    Manufacturing (CIM), Advanced

    office automation systems,Hospital patient care tracking

    systems, etc. All of these

    applications are characterized by

    having to manage complex,

    highly interrelated information,

    which is a strength of object-

    oriented database systems.

    ORDBMS is suitable for

    applications like: Complex data

    analysis, Digital Asset

    Management, Gio-graphic Data,

    Bio-medical

    Examples of RDBMS: Oracle, SQL

    server, MySQL, etc

    Examples of OODBMS: Object

    store, Versant, Gemstone, etc.

    Examples of ORDBMS: Postgres,

    SQL 92

    Standard Query Language is

    present i.e: SQL

    Lack of standard query language. Lack of standard query language

  • 8/10/2019 Ad Bms Notes

    20/44

    UNIT- 3

    Parallel and Distributed Databases

    A parallel database system is one that seeks to improve performance through parallelimplementation of various operations such as loading data, building indexes, and evaluatingqueries.

    Parallel Database Systems

    A parallel database system tries to improve performance through parallelization of variousoperations such as loading data , evaluating queries etc. the main goal of such system is toimprove the performance. Whereas, in case of distributed database systems, the datadistribution is the governing factor. The main goal of such systems is to increase theavailability and reliability.

    Some terms that defines systems performance:

    Throughput:Number of tasks (transactions) that can be completed in a given time interval.

    Response Time:Amount of time taken to complete a single task from the time it issubmitted.

    A system that processes large number of small transactions can improve throughput byprocessing many transactions in parallel.

    A system that processes large transactions can improve response time and throughput bydividing each transaction into number of sub-transactions that can be executed in parallel.

    Speed-Up:Running a given task in less time by increasing the degree of parallelism is callspeed up.

    Speed Up = Ts/Tl where Ts= Time required on small system

    Tl= time required on large system with more resources.A parallel system is said to demonstrate linear speed up if the speed up is N, whenresources are increased N times.

    Scale-Up:Handling larger tasks in same amount of time by increasing the degree ofparallelism is called scale up.

    Scale-Up= Ts/Tl where Ts= time required to execute task of size Q

    Tl= time required to execute task of size Q*N

    The parallel system is said to demonstrate linear scale up on task of size Q if Ts=Tl whenresources are increased N times.

    Parallel Database architectures:

    Three main architectures are proposed for building parallel databases:1. Shared - memory :- (All processors share common memery) where multiple CPUs are

    attached to an interconnection network and can access a common region of mainmemory.In shared memory architecture, the processors and disks have access to commonmemory via a bus or through an interconnection network.

    A processor can send messages to other processors using memory writes.This message sending is the much faster communication mechanism.

  • 8/10/2019 Ad Bms Notes

    21/44

    Advantage:Shared memory is an extremely effiecient communication between processors

    and data in shared memory can be accessed by any processor without being moved withsoftware.

    Disadvantage:shared memory architecture is not scalable beyond 32 or 64 processors,since the bus or interconnection network becomes bottleneck.

    2. Shareddisk(All processors share common disk & have private memories). where eachCPU has a private memory and direct access to all disks through an interconnectionnetwork.

    Advantages:Each processor has its own local memory, so the memory bus is not

    bottleneck.This architecture provides higher degree of fault tolerance.(If a processor fails, the otherprocessors can take over its task)

    Disadvantage:The interconnection to the disk subsystem is now a bottleneck.

    3. Sharednothing(Each node of machine consists of a processor, memory and one ormore disks). where each CPU has local main memory and disk space, but no two CPUscan access the same storage area; all communication between CPUs is through anetwork connection.

    Advantages:Instead of passing all I/O to go through a single interconnection network, onlyqueries to non local disks and result relations are passed through network.These architectures are more scalable and can easily support large number ofprocessors.Transmissions capacity increases as more nodes can be added.

    Disadvantage:Cost of communication and non local disk access are higher as comparedto others because transmitting data involves software interaction at both ends.

    PARALLEL QUERY EVALUATION

  • 8/10/2019 Ad Bms Notes

    22/44

    Parallel evaluation of a relational query in a DBMS with a shared-nothing architecture isdiscussed. Parallel execution of a single query has been emphasized.

    A relational query execution plan is a graph of relational algebra operators and theoperators in a graph can be executed in parallel. If an operator consumes the output of asecond operator, we have pipelined parallelism.

    Each individual operator can also be executed in parallel by partitioning the input dataand then working on each partition in parallel and then combining the result of eachpartition. This approach is called Data Partitioned parallel Evaluation.

    Data Partitioning: Here large datasets are partitioned horizontally across several disk, thisenables us to exploit the I/O bandwidth of the disks by reading and writing them in parallel.This can be done in the following ways:

    a. Round Robin Partitioning

    b. Hash Partitioning

    c. Range Partitioning

    a. Round Robin Partitioning :If there are n processors, the ith tuple is assigned toprocessor i mod n

    b. Hash Partitioning :A hash function is applied to (selected fields of) a tuple to determineits processor.

    Hash partitioning has the additional virtue that it keeps data evenly distributed even if thedata grows and shrinks over time.c. Range Partitioning : Tuples are sorted (conceptually), and n ranges are chosen for thesort key values so that each range contains roughly the same number of tuples; tuples inrange i are assigned to processor i.

    Range partitioning can lead to data skew; that is, partitions with widely varying numbers oftuples across partitions or disks. Skew causes processors dealing with large partitions tobecome performance bottlenecks.

    PARALLELIZING INDIVIDUAL OPERATIONS

    Various operations can be implemented in parallel in a shared nothing architecture.Bulk Loading and Scanning: Pages can be read in parallel while scanning a relation and the retrieved tuples can thenbe merged, if the relation is partitioned across several disks.

    If a relation has associated indexes, any sorting of data entries required for building theindexes during bulk loading can also be done in parallel.

  • 8/10/2019 Ad Bms Notes

    23/44

    Sorting: Sorting could be done by redistributing all tuples in the relation using range partitioning.

    Ex. Sorting a collection of employee tuples by salary whose values are in a certainrange.

    For N processors each processor gets the tuples which lie in range assigned to it. Likeprocessor 1 contains all tuples in range 10 to 20 and so on.

    Each processor has a sorted version of the tuples which can then be combined bytraversing and collecting the tuples in the order on the processors (according to the rangeassigned)

    The problem with range partitioning is data skew which limits the scalability of theparallel sort. One good approach to range partitioning is to obtain a sample of the entirerelation by taking samples at each processor that initially contains part of the relation. The(relatively small) sample is sorted and used to identify ranges with equal numbers of tuples.

    This set of range values, called a splitting vector, is then distributed to all processors andused to range partition the entire relation.

    Joins: Here we consider how the join operation can be parallelized

    Consider 2 relations A and B to be joined using the age attribute. A and B are initiallydistributed across several disks in a way that is not useful for join operation

    So we have to decompose the join into a collection of k smaller joins by partitioning bothA and B into a collection of k logical partitions.

    If same partitioning function is used for both A and B then the union of k smaller joins willcompute to the join of A and B.

    DISTRIBUTED DATABASES

    The idea of a distributed database is that the data should be physically stored at differentlocations but its distribution and access should be transparent to the user.

    Introduction to DBMS:

    A Distributed Database should exhibit the following properties:

    1) Distributed Data Independence: - The user should be able to access the databasewithout having the need to know the location of the data.

    2) Distributed Transaction Atomicity: - The concept of atomicity should be distributed forthe operation taking place at the distributed sites.

  • 8/10/2019 Ad Bms Notes

    24/44

    Types of Distributed Databases are:-

    a) Homegeneous Distributed Database is where the data stored across multiple sites ismanaged by same DBMS software at all the sites.

    b) Heterogeneous Distributed Database is where multiple sites which may be autonomousare under the control of different DBMS software.

    Architecture of DDBs :

    There are 3 architectures: -

    Client-Server: A Client-Server system has one or more client processes and one or more serverprocesses, and a client process can send a query to any one server process. Clients areresponsible for user-interface issues, and servers manage data and execute transactions.

    Thus, a client process could run on a personal computer and send queries to a serverrunning on a mainframe.

    Advantages: -1. Simple to implement because of the centralized server and separation of functionality.

    2. Expensive server machines are not underutilized with simple user interactions which arenow pushed on to inexpensive client machines.

    3. The users can have a familiar and friendly client side user interface rather than unfamiliarand unfriendly server interface

    Collaborating Server:

    In the client sever architecture a single query cannot be split and executed acrossmultiple servers because the client process would have to be quite complex and intelligentenough to break a query into sub queries to be executed at different sites and then placetheir results together making the client capabilities overlap with the server. This makes ithard to distinguish between the client and server

    In Collaborating Server system, we can have collection of database servers, eachcapable of running transactions against local data, which cooperatively execute transactionsspanning multiple servers.

    When a server receives a query that requires access to data at other servers, itgenerates appropriate sub queries to be executed by other servers and puts the resultstogether to compute answers to the original query.

    Middleware: Middleware system is as special server, a layer of software that coordinates theexecution of queries and transactions across one or more independent database servers.

  • 8/10/2019 Ad Bms Notes

    25/44

    The Middleware architecture is designed to allow a single query to span multipleservers, without requiring all database servers to be capable of managing such multi siteexecution strategies. It is especially attractive when trying to integrate several legacysystems, whose basic capabilities cannot be extended.

    We need just one database server that is capable of managing queries and transactionsspanning multiple servers; the remaining servers only need to handle local queries andtransactions.

    STORING DATA IN DDBS

    Data storage involved 2 concepts1. Fragmentation

    2. Replication

    Fragmentation:

    It is the process in which a relation is broken into smaller relations called fragments andpossibly stored at different sites.

    It is of 2 types

    1. Horizontal Fragmentation where the original relation is broken into a number offragments, where each fragment is a subset of rows. The union of the horizontal fragmentsshould reproduce the original relation.

    2. Vertical Fragmentation where the original relation is broken into a number of fragments,where each fragment consists of a subset of columns.

    The system often assigns a unique tuple id to each tuple in the original relation so that the

    fragments when joined again should from a lossless join. The collection of all vertical fragments

    should reproduce the original relation.

    Replication: Replication occurs when we store more than one copy of a relation or its fragment atmultiple sites.

    Advantages:-1. Increased availability of data: If a site that contains a replica goes down, we can findthe same data at other sites. Similarly, if local copies of remote relations are available, weare less vulnerable to failure of communication links.

    2. Faster query evaluation: Queries can execute faster by using a local copy of a relationinstead of going to a remote site.

  • 8/10/2019 Ad Bms Notes

    26/44

    Distributed catalog management :

    Naming Object Its related to the unique identification of each fragment that has beeneither partitionedor replicated. This can be done by using a global name server that can assign globally unique names.

    This can be implemented by using the following two fields:-

    1. Local name fieldlocally assigned name by the site where the relation is created. Twoobjects at different sites can have same local names.

    2. Birth site fieldindicates the site at which the relation is created and where informationabout its fragments and replicas is maintained.

    Catalog Structure: A centralized system catalog is used to maintain the information about all thetransactions in the distributed database but is vulnerable to the failure of the site containingthe catalog.

    This could be avoided by maintaining a copy of the global system catalog but it involvesbroadcast of every change done to a local catalog to all its replicas.

    Another alternative is to maintain a local catalog at every site which keeps track of all thereplicas of the relation.

    Distributed Data Independence:

    It means that the user should be able to query the database without needing to specifythe location of the fragments or replicas of a relation which has to be done by the DBMS

    Users can be enabled to access relations without considering how the relations aredistributed as follows:The local name of a relation in the system catalog is a combination of a user name and auser-defined relation name. When a query is fired the DBMS adds the user name to the relation name to get a localname, then adds the user's site-id as the (default) birth site to obtain a global relation name.By looking up the global relation name in the local catalog if it is cached there or in thecatalog at the birth site the DBMS can locate replicas of the relation.

    Distributed query processing:

    In a distributed system several factors complicates the query processing.

    One of the factors is cost of transferring the data over network.

    This data includes the intermediate files that are transferred to other sites for furtherprocessing or the final result files that may have to be transferred to the site where thequery result is needed.

  • 8/10/2019 Ad Bms Notes

    27/44

    Although these cost may not be very high if the sites are connected via a high local n/wbut sometime they become quit significant in other types of network.

    Hence, DDBMS query optimization algorithms consider the goal of reducing the amountof data transfer as an optimization criterion in choosing a distributed query executionstrategy.

    Consider an EMPLOYEE relation.

    The size of the employee relation is 100 * 10,000=10^6 bytes

    The size of the department relation is 35 * 100=3500 bytes

    10,000 records

    Each record is 100 bytes

    Fname field is 15 bytes long

    SSN field is 9 bytes long

    Lname field is 15 bytes long

    Dnum field is 4 byte long

    100records

    Each record is 35 bytes long

    Dnumber field is 4 bytes long

    Dname field is 10 bytes long

    MGRSSN field is 9 bytes long

    Now consider the following query:For each employee, retrieve the employee name and the name of the department for whichthe employee works.

    Using relational algebra this query can be expressed asFNAME, LNAME, DNAME ( EMPLOYEE * DNO=DNUMBER DEPARTMENT) If we assume that every employee is related to a department then the result of this querywill include 10,000 records.

  • 8/10/2019 Ad Bms Notes

    28/44

    Now suppose that each record in the query result is 40 bytes long and the query issubmitted at a distinct site which is the result site.

    Then there are 3 strategies for executing this distributed query:1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the site 3 that is yourresult site and perform the join at that site. In this case a total of 1,000,000 + 3500 =

    1,003,500 bytes must be transferred.

    2. Transfer the EMPLOYEE relation to site 2 (site where u have Department relation) andsend the result to site 3. the size of the query result is 40 * 10,000 = 400,000 bytes so400,000 + 1,000,000 = 1,400,000 bytes must be transferred.

    3. Transfer the DEPARTEMNT relation to site 1 (site where u have Employee relation) andsend the result to site 3. in this case 400,000 + 3500 = 403,500 bytes must be transferred.

    Nonjoin Queries in a Distributed DBMS:Consider the following two relations:

    Sailors (sid: integer, sname:string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: date, rname: string)Now consider the following query:

    SELECT S.age FROM Sailors S WHERE S.rating > 3 AND S.rating < 7Now suppose that sailor relation is horizontally fragmented with all the tuples having a ratingless than 5 at Shanghais and all the tuples having a rating greater than 5 at Tokyo.

    The DBMS will answer this query by evaluating it both sites and then taking the union of theanswer.

    Joins in a Distributed DBMS:

    Joins of a relation at different sites can be very expensive so now we will consider theevaluation option that must be considered in a distributed environment. Suppose that Sailors relation is stored at London and Reserves relation is stored atParis. Hence we will consider the following strategies for computing the joins for Sailors andReserves.

    In the next example the time taken to read one page from disk (or to write one page todisk) is denoted as td and the time taken to ship one page (from any site to another site) asts.

    DISTRIBUTED CONCURRENCY CONTROL AND RECOVERYThe main issues with respect to the Distributed transaction are:

    Distributed Concurrency Control

    How can deadlocks be detected in a distributed database?

    How can locks for objects stored across several sites be managed?

  • 8/10/2019 Ad Bms Notes

    29/44

    Distributed Recovery

    When a transaction commits, all its actions across all the sites at which it executes mustpersist.

    When a transaction aborts none of its actions must be allowed to persist.

    Concurrency Control and Recovery in Distributed Databases: For currency control and recovery

    purposes, numerous problems arise in a distributed DBMS environment that is not encountered in a

    centralized DBMS environment.

    This includes the following:

    Dealing with multiple copies of the data items: The concurrency control method is responsible for

    maintaining consistency among these copies. The recovery method is responsible for making a copy

    consistent with other copies if the site on which he copy is stored fails and recovers later.

    Failure of individual sites: The DBMS should continue to operate with its running sites, if possible

    when one or the more individual site fall. When a site recovers its local database must be brought

    up to date with the rest of the sites before it rejoins the system.

    Failure of communication links: The system must be able to deal with failure of one or more of the

    communication links that connect the sites. An extreme case of this problem is that network

    partitioning may occur. This breaks up the sites into two or more partitions where the sites within

    each partition can communicate only with one another and not with sites in other partitions.

    Distributed Commit: Problems can arise with committing a transactions that is accessing database

    stored on multiple sites if some sites fail during the commit process. The two-phase commit

    protocol is often used to deal with this problem.

    Distributed Deadlock: Deadlock may occur among several sites so techniques for dealing with

    deadlocks must be extended to take this into account.

    Lock management can be distributed across sites in many ways:

    Centralized: A single site is in charge of handling lock and unlock requests for allobjects.

    Primary copy: One copy of each object is designates as the primary copy. All requeststo lock or unlock a copy of these objects are handled by the lock manager at the site wherethe primary copy is stored, regardless of where the copy itself is stored.

    Fully Distributed: Request to lock or unlock a copy of an object stored at a site arehandled by the lock manager at the site where the copy is stored.

    Distributed Deadlock One issue that requires special attention when using either primary copy or fullydistributed locking is deadlocking detection

    Each site maintains a local waits-for graph and a cycle in a local graph indicates adeadlock.For example: Suppose that we have two sites A and B, both contain copies of objects O1 and O2 andthat the read-any write-all technique is used.

  • 8/10/2019 Ad Bms Notes

    30/44

    T1 which wants to read O1 and write O2 obtains an S lock on O1 and X lock on O2 atsite A, and request for an X lock on O2 at site B.

    T2 which wants to read O2 and write O1 mean while obtains an S lock on O2 and an Xlock on O1 at site B then request an X lock on O1 at site A.

    As shown in the following figure T2 is waiting for T1 at site A and T1 is waiting for T2 atsite B thus we have a Deadlock.

    To detect such deadlocks, a distributed deadlock detection algorithm must be used and wehave three types of algorithms:

    1. Centralized Algorithm:

    It consist of periodically sending all local waits-for graphs to some one site that isresponsible for global deadlock detection.

    At this site, the global waits-for graphs is generated by combining all local graphs and inthe graph the set of nodes is the union of nodes in the local graphs and there is an edgefrom one node to another if there is such an edge in any of the local graphs.

    2. Hierarchical Algorithm:

    This algorithm groups the sites into hierarchies and the sites might be grouped by states,then by country and finally into single group that contain all sites.

    Every node in this hierarchy constructs a waits-for graph that reveals deadlocks involvingonly sites contained in (the sub tree rooted at) this node.

    Thus, all sites periodically (e.g., every 10 seconds) send their local waits-for graph to thesite constructing the waits-for graph for their country.

    The sites constructing waits-for graph at the country level periodically (e.g., every 10minutes) send the country waits-for graph to site constructing the global waits-for graph.

  • 8/10/2019 Ad Bms Notes

    31/44

    3. Simple Algorithm:

    If a transaction waits longer than some chosen time-out interval, it is aborted.

    Although this algorithm causes many unnecessary restart but the overhead of thedeadlock detection is low.

    Distributed Recovery: Recovery in a distributed DBMS is more complicated than in acentralized DBMS for the following reasons: New kinds of failure can arise: failure of communication links and failure of remote site atwhich a sub transaction is executing.

    Either all sub transactions of a given transaction must commit or none must commit andthis property must be guaranteed despite any combination of site and link failures. Thisguarantee is achieved using a commit protocol.

    Normal execution and Commit Protocols: During normal execution each site maintains a log and the actions of a sub transaction

    are logged at the site where it executes.

    The regular logging activity is carried out which means a commit protocol is followed toensure that all sub transactions of a given transaction either commit or abort uniformly.

    The transaction manager at the site where the transaction originated is called theCoordinator for the transaction and the transaction managers where its sub transactionsexecute are called Subordinates.

    Two Phase Commit Protocol: When the user decides to commit the transaction and the commit command is sent tothe coordinator for the transaction.

    This initiates the 2PC protocol:

    The coordinator sends a Prepare message to each subordinate.

    When a subordinate receive a Prepare message, it then decides whether to abort orcommit its sub transaction. it force-writes an abort or prepares a log record and then sendsa NO or Yes message to the coordinator.

    Here we can have two conditions:

    o If the coordinator receives Yes message from all subordinates. It force-writes a

    commit log record and then sends a commit message to all the subordinates.

    o If it receives even one No message or No response from some coordinates for a

    specified time-out period then it will force-write an abort log record and then sends an abortmessage to all subordinate.

    Here again we can have two conditions:

    o When a subordinate receives an abort message, it force-writes an abort log

    record sends an ack message to coordinator and aborts the sub transaction.

  • 8/10/2019 Ad Bms Notes

    32/44

    o When a subordinates receives a commit message, it force-writes a commit log

    record and sends an ack message to the coordinator and commits the sub transaction.

    UNIT IV

    INTRODUCTION TO DATABASE SECURITYThere are three main objectives to consider while designing a secure database application:

    1. Secrecy: Information should not be disclosed to unauthorized users. For example, a student shouldnot be allowed to examine other students' grades.

    2. Integrity: Only authorized users should be allowed to modify data. For example, students may beallowed to see their grades, yet not allowed (obviously!) to modify them.

    3. Availability:Authorized users should not be denied access. For example, an instructor who wishes tochange a grade should be allowed to do so.

    A DBMS typically includes a database security and authorization subsystem that is responsible forensuring the security of portions of a database against unauthorized access. It is now customary to referto two types of database security mechanisms:

    Discretionary Security mechanism: These are used to grant privileges to users, including the capability toaccess specific data files, records, or fields in a specified mode(such as read, insert,delete, or update).

    Mandatory security mechanisms: These are used to enforce multilevel security by classifying the data andusers into various security classes (or levels) and then implementing the appropriate security policy of theorganization. For example, a typical policy is to purmit users at a certain classification level to see onlydata items classified at the users own level. An extension of this is role-based security, which enforces

    policies and privileges based on the concept of roles.

    ACCESS CONTROL

    A DBMS should provide mechanisms to control access to data. A DBMS offers two main approaches toaccess control.

    Discretionary access control

    Mandatory access control

  • 8/10/2019 Ad Bms Notes

    33/44

    Discretionary access control: It is based on the concept of access rights, or privileges, andmechanisms for users. A privilege allows a user to access some data object in a certain manner ( e.g., toread or to modify). A user who creates a database object such as a table or a view automatically gets allapplicable privileges on that object. SQL-92 supports discretionary access control through the GRANTand REVOKEcommands.

    The GRANT command gives privileges to users.

    The GRANT command gives privileges to base table and views. The syntax of this command is asfollows:

    GRANT privileges ON object TO users [WITH GRANT OPTION]

    Here object is either a base table or a view.

    Several privileges can be specified including:

    SELECT:The right to access (read) all columns of the table specified as object, including columns addedlater through ALTER TABLE commands.

    INSERT(column-name): The right to insert rows with (non-null or non default) values in the namedcolumn of the table named as object. The privileges UPDATE(column-name) and UPDATE are similar toINSERT.

    DELETE:The right to delete rows from the table named as object.

    REFERENCES(column-name): The right to define foreign keys (in other tables) that refer to thespecified column of the table object. REFERENCES without a column name specified denotes this rightwith respect to all columns.

    For Example:

    Suppose that user joe has created the tables BOATS, RESERVES, and SAILORS. Some examples ofGRANT command that joe can now execute are:

    GRANT INSERT, DELETE ON RESERVES TO Yuppy WITH GRANT OPTION

    GRANT SELECT ON RESERVES TO Michel

    GRANT SELECT ON SAILORS TO Michael WITH GRANT OPTION

    GRANT UPDATE (rating) ON SAILORS TO Leah

    GRANT REFERENCES (bid) ON BOATS TO Bill

    Adding WITH GRANT OPTION at the end of the grant command allows the user who has been grantedthe privilege to pass those privilege to other user.

    In the above examples. Yuppy can insert or delete Reserves rows and can authorize someone else to dothe same. Michael can execute Select queries on Sailors and Reserves, and he can pass this privilege toothers for sailors, but not for Reserves.

    The REVOKE command takes away privileges.

    This is complementary command to GRANT that allows the withdrawal of privileges.

    The syntax of REVOKE Command is as follows:REVOKE [ GRANT OPTION FOR] Privileges

    ON object FROM users {RESTRICT|CASCADE}

    The command can be used to revoke either a privilege or just the grant option on a privilege( by using theoption GRANT OPTION FOR clause).

    A user who has granted a privilege to other user may change his mind and want to withdraw the grantedprivilege. The intuition behind exactly what effect a REVOKE command has is complicated by the factthat a user may be granted the same privilege multiple times, possible by different users.

  • 8/10/2019 Ad Bms Notes

    34/44

    When a user executes a REVOKE command with the CASCADE keyword, the effect is to withdraw thenamed privileges or grant option from all users who currently hold these privileges solely through aGRANT command that was previously executed by some user who is now executing the REVOKEcommand. If these users received the privileges with the grant option and passed it along, thoserecipients will also lose their privileges as consequence of the REVOKE command unless they receivedthese privileges independently.

    For Example:

    GRANT SELECT ON Sailors TO Art WITH GRANT OPTION (executed by Joe)GRANT SELECT ON Sailors TO Bob WITH GRANT OPTION (executed by Art)

    REVOKE SELECT ON Sailors FROM Art CASCADE (executed by Joe)

    Art loses the SELECT privilege on Sailors, of course. Then Bob, who received this privilege from Art, andonly Art, also loses this privilege.If the RESTRICT keyword is specified in the REVOKE command, the command is rejected if revoking theprivilegesjust from the users specified in the command would result in other privileges becomingabandoned.

    Mandatory access control: It is based on system wide policies that cannot be changed by individual

    users. In this approach each database object is assigned a security class, each user is assignedclearance for a security class, and rules are imposed on reading and writing of database objects by users.The DBMS determines whether a given user can read or write a given object based on certain rules thatinvolve the security level of the object and the clearance of the user.

    The popular model for mandatory access control, called the Bell-LaPadula model, is described in terms ofobjects (e.g., tables, views, rows, columns), subjects (e.g., users, programs), security classes, andclearances. Each database object is assigned a security class, and each subject is assigned clearancefor a security class; we will denote the class of an object or subject A as class(A). The security classes ina system are organized according to a partial order, with a most secure class and a least secure class.For simplicity, we will assume that there are four classes: top secret (TS), secret (S), confidential (C), andunclassified (U). In this system, TS > S > C > U, whereA > B means that classA data is more sensitivethan class B data.

    The Bell-LaPadula model imposes two restrictions on all reads and writes of database objects:

    1. Simple Security Property: Subject S is allowed to read object O only if class(S)class(O). Forexample, a user with TS clearance can read a table with C clearance, but a user with C clearance is notallowed to read a table with TS classification.

    2. *-Property: Subject S is allowed to write object O only if class(S) class(O). For example, a user withS clearance can only write objects with S or TS classification.

    Multilevel Relations and Polyinstantiation

    To apply mandatory access control policies in a relational DBMS, a security class must be assigned to

    each database object. The objects can be at the granularity of tables, rows, or even individual columnvalues. Let us assume that each row is assigned a security class. This situation leads to the concept of amultilevel table, which is a table with the surprising property that users with di_erent security clearanceswill see a different collection of rows when they access the same table.

    Consider the instance of the Boats table shown in Figure below. Users with S and TS clearance will getboth rows in the answer when they ask to see all rows in Boats. A user with C clearance will get only thesecond row, and a user with U clearance will get no rows.

  • 8/10/2019 Ad Bms Notes

    35/44

    bid bname color Security class101 Salsa Red S

    102 Pinto Brown C

    The Boats table is defined to have bid as the primary key. Suppose that a user with clearance C wishesto enter the row . We have a dilemma:

    If the insertion is permitted, two distinct rows in the table will have key 101.

    If the insertion is not permitted because the primary key constraint is violated, the user trying to insert thenew row, who has clearance C, can infer that there is a boat with bid=101 whose security class is higherthan C. This situation compromises the principle that users should not be able to infer any informationabout objects that have a higher security classification.

    This dilemma is resolved by effectively treating the security classification as part of the key. Thus, the

    insertion is allowed to continue, and the table instance is modified as shown in Figure below.

    bid bname color Security class101 Salsa Red S101 Picante Scarlet C

    102 Pinto Brown C

    Users with clearance C or U see just the rows for Picante and Pinto, but users with clearance S or TS seeall three rows. The two rows with bid=101 can be interpreted in one of two ways: only the row with thehigher classification (Salsa, with classification S) actually exists, or both exist and their presence isrevealed to users according to their clearance level. The choice of interpretation is up to application

    developers and users.

    Covert Channels, DoD Security Levels

    Even if a DBMS enforces the mandatory access control scheme discussed above, information can flowfrom a higher classification level to a lower classification level through indirect means, called covertchannels. For example, if a transaction accesses data at more than one site in a distributed DBMS, theactions at the two sites must be coordinated. The process at one site may have a lower clearance (say C)than the process at another site (say S), and both processes have to agree to commit before thetransaction can be committed. This requirement can be exploited to pass information with an Sclassification to the process with a C clearance: The transaction is repeatedly invoked, and the processwith the C clearance always agrees to commit, whereas the process with the S clearance agrees tocommit if it wants to transmit a 1 bit and does not agree if it wants to transmit a 0 bit.

    In this manner, information with an S clearance can be sent to a process with a C clearance as a streamof bits. This covert channel is an indirect violation of the intent behind the *-Property.

    Role of the Database Administrator

    The database administrator (DBA) plays an important role in enforcing the security related aspects of adatabase design. In conjunction with the owners of the data, the DBA will probably also contribute to

  • 8/10/2019 Ad Bms Notes

    36/44

    developing a security policy. The DBA has a special account, which we will call the system account, andis responsible for the overall security of the system. In particular the DBA deals with the following:

    1. Creating new accounts: Each new user or group of users must be assigned an authorization id and apassword. Note that application programs that access the database have the same authorization id as theuser executing the program.

    2. Mandatory control issues: If the DBMS supports mandatory control some customized systems forapplications with very high security requirements (for example, military data) provide such support theDBA must assign security classes to each database object and assign security clearances to eachauthorization id in accordance with the chosen security policy.

    3.Audit trail: The DBA is also responsible for maintaining the audit trail, which is essentially the log ofupdates with the authorization id (of the user who is executing the transaction) added to each log entry.This log is just a minor extension of the log mechanism used to recover from crashes. Additionally, theDBA may choose to maintain a log of all actions, including reads, performed by a user. Analyzing suchhistories of how the DBMS was accessed can help prevent security violations by identifying suspiciouspatterns before an intruder finally succeeds in breaking in, or it can help track down an intruder after aviolation has been detected.

    Encryption

    A DBMS can use encryption to protect information in certain situations where the normal securitymechanism of the DBMS are not adequate. For example, an intruder may steal tapes containing somedata or tape a communication line. By storing and transmitting data in an encrypted form, the DBMSensures that such stolen data is not intelligible to the intruder.

    Encryption is basically done through encryption algorithm. The output of the algorithm is the encryptedversion of the data. There is also a decryption algorithm, which takes the encrypted data and theencryption key as input and then returns the original data. This approach is called Data EncryptionStandard (DES). The main weakness of this approach is that authorized users must be told theencryption key, and the mechanism for communicating this information is vulnerable to clever intruders.

    Another approach is called Public Key encryption. The encryption scheme proposed by Rivest, Shamir,and Adleman, called RSA, is a well-known example of public-key encryption. In this each authorized userhas a public encryption key, known to everyone, and a private decryption key, choosen by the user andknown only to him or her.For example: Consider a user called sam. Anyone can send sam a secret message by encrypting themessage using sams publicly known encryption key. Only sam can decrypt this secret message beca usethe decryption algorithm requires sams decryption key, known only to sam. Since users choose their owndecryption keys, the weakness of DES is avoided.

  • 8/10/2019 Ad Bms Notes

    37/44

    UNIT V

    What is Postgres?Traditional relational database management systems (DBMSs) support a data model consisting of acollection of named relations, containing attributes of a specific type. In current commercial systems,possible types include floating point numbers, integers, character strings, money, and dates. It iscommonly recognized that this model is inadequate for future data processing applications. The relationalmodel successfully replaced previous models in part because of its "Spartan simplicity". However, asmentioned, this simplicity often makes the implementation of certain applications very difficult. Postgresoffers substantial additional power by incorporating the following four additional basic concepts in such away that users can easily extend the system:

    classesinheritancetypesfunctions

    Other features provide additional power and flexibility:constraintstriggersrulestransaction integrity

    These features put Postgres into the category of databases referred to as object-relational. Postgres is aclient/server application. As a user, you only need access to the client portions of the installation

    POSTGRES ARCHITECTURE

    Postgres uses a simple "process per-user" client/server model. A Postgres session consists of thefollowing cooperating UNIX processes (programs):

    A supervisory daemon process (postmaster),The users frontend application (e.g., the psql program), andThe one or more backend database servers (the postgres process itself).

    A single postmaster manages a given collection of databases on a single host. Such a collection ofdatabases is called an installation or site. Frontend applications that wish to access a given databasewithin an installation make calls to the library. The library sends user requests over the network to the

    postmaster (How a connection is established), which in turn starts a new backend server process andconnects the frontend process to the new server. From that point on, the frontend process and thebackend server communicate without intervention by the postmaster. Hence, the postmaster is alwaysrunning, waiting for requests, whereas frontend and backend processes come and go.

  • 8/10/2019 Ad Bms Notes

    38/44

  • 8/10/2019 Ad Bms Notes

    39/44

  • 8/10/2019 Ad Bms Notes

    40/44

    PostgreSQL actually treats every SQL statement as being executed within a transaction. If you

    do not issue a BEGINcommand, then each individual statement has an implicit BEGINand (if

    successful) COMMITwrapped around it. A group of statements surrounded by BEGINand COMMITis sometimes called a transaction block.

    XML stands for the eXtensible Markup Language. It is a new markup language, developed bythe W3C (World Wide Web Consortium)

    Some of the areas where XML will be useful in the near-term include:

    largeWeb site maintenance. XML would work behind the scene to simplify the creation ofHTML documents exchange of information between organizations off loading and reloading of databases

    syndicated content, where content is being made available to different Web sites electronic commerce applications where different organizations collaborate to serve a customer scientific applications with new markup languages for mathematical and chemical formulas electronic books with new markup languages to express rights and ownership handheld devices and smart phones with new markup languages optimized for these

    alternative devices

    XML makes essentially two changes to HTML: It predefines no tags. It is stricter.

    No Predefined TagsBecause there are no predefined tags in XML, you, the author, can create the tags that you need.Example:

    499.00Pineapplesoft Link

    StricterHTML has a very forgiving syntax. This is great for authors who can be as lazy as they want, butit also makes Web browsers more complex. According to some estimates, more than 50% of thecode in a browser handles errors or sloppiness on the authors part.

    XML Example:

    A List of Products in XML

  • 8/10/2019 Ad Bms Notes

    41/44

    XML Editor499.00DTD Editor

    199.00XML Book19.99XML Training699.00

    In this context, XML is used to exchange information between organizations.The XML Web is a large database on which applications can tap

    Applications exchanging data over the Web

    XML Syntax

    The syntax rules were described in the previous chapters:

    XML documents must have a root element XML elements must have a closing tag XML tags are case sensitive XML elements must be properly nested XML attribute values must be quoted

  • 8/10/2019 Ad Bms Notes

    42/44

    XML Schemas

    The DTD is the original modeling language or schema for XML.

    The syntax for DTDs is different from the syntax for XML documents.

    The purpose of a DTD is to define the structure of an XML document. It defines the structurewith a list of legal elements:

    Example:

    ToveJaniReminderDon't forget me this weekend!

    ]>

    XML Schema

    W3C supports an XML-based alternative to DTD, called XML Schema:

  • 8/10/2019 Ad Bms Notes

    43/44

    XML NAMESPACES

    XML Namespaces provide a method to avoid element name conflicts.

    An XML namespace is a collection of element and attribute names. XML namespaces provide ameans for document authors to unambiguously refer to elements with the same name (i.e.,prevent collisions). For example,

    Geometry

    And

    Cardiology

    Use element subject to mark up data. In the first case, the subject is something one studies inschool, whereas in the second case, the subject is a field of medicine. Namespaces candifferentiate these two subject elementsfor example:

    Geometry

    And

    Cardiology

    Benefits of the DTD

    The main benefits of using a DTD are The XML processor enforces the structure, as defined in the DTD. The application accesses the document structure, such as to populate an element list. The DTD gives hints to the XML processorthat is, it helps separate indenting from content. The DTD can declare default or fixed values for attributes. This might result in a smallerdocument.

    XSL

    XSL stands for EXtensible Stylesheet Language.

    The World Wide Web Consortium (W3C) started to develop XSL because there was a need foran XML-based Stylesheet Language.

  • 8/10/2019 Ad Bms Notes

    44/44

    XSL = Style Sheets for XML

    XML does not use predefined tags (we can use any tag-names we like), and therefore themeaning of each tag is not well understood.

    A tag could mean an HTML table, a piece of furniture, or something else - and abrowser does not know how to display it.

    XSL describes how the XML document should be displayed!

    XSL consists of three parts:

    XSLT - a language for transforming XML documents XPath - a language for navigating in XML documents XSL-FO - a language for formatting XML documents

    What is XSLT?XSLT is a language for transforming XML documents into XHTML documents or to other XML documents.

    XSLT stands for XSL Transformations

    XSLT is the most important part of XSL

    XSLT transforms an XML document into another XML document

    XSLT uses XPath to navigate in XML documents

    XSLT is a W3C Recommendation

    XPath is a language for navigating in XML documents. XSLT uses XPath to find information in an XML

    document. XPath is used to navigate through elements and attributes in XML documents.

    What is XSL-FO?

    XSL-FO is a language for formatting XML data

    XSL-FO stands for Extensible Stylesheet Language Formatting Objects

    XSL-FO is based on XML

    XSL-FO is a W3C Recommendation

    XSL-FO is now formally named XSL