Data Engineering

Embed Size (px)

Citation preview

  • 8/8/2019 Data Engineering

    1/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    Out-of-the-Box Data EngineeringEvents in Heterogeneous Data Environments

    Ramesh Jain

    Electrical and Computer Engineering, andCollege of Computing

    Georgia Institute of Technology,

    Atlanta, GA 30332-0250

    [email protected]

    Abstract

    Data has changed significantly over the last few decades. Computing systems thatinitially dealt with data and computation rapidly moved to information and

    communication. The next step on the evolutionary scale is insight and experience.Applications are demanding the use of live, spatio-temporal, heterogeneous data. Data

    engineering must keep pace by designing experiential environments that let users applytheir senses to observe data and information about an event and to interact with aspects ofthe event that are of particular interest. I call this out-of-the-box data engineering because

    it means we must think beyond many of our timeworn perspectives and technicalapproaches.

    1.0 Introduction

    Data engineering has evolved and continues to evolve faster than most people everimagined. While computing in the early 1970s had only alphabet and numbers,

    technology now furnishes its users with an unprecedented volume and variety ofdatafrom encyclopedia pages to clips of the latest music, from a spreadsheet to a real-

    time recording of a triple bypass. And access methods and requirements are evolving atthe same pace.

    To keep pace with the demand for live, spatio-temporal, heterogeneous data, data

    engineering must let go of old paradigms. They have outlived their application. It is timeto think out of the boxto consider what the operating environments of these new

    systems should look like. How can we build something that is experiential, notinformation-centric?

    Equally interesting is that user expectations of the data system have changed morerapidly than the data itself. To keep up with these changes, we must consider what theoperating environments of future systems would be and how to realize those

    environments rather than how to accommodate new functionality in our existingparadigm.

    In this paper, I look at the changing nature of applications by considering a few novel

    applications that use large volumes of data and then discuss the functionality expected

  • 8/8/2019 Data Engineering

    2/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    from these systems. That computing systems have evolved to follow user demand andapplication development is an important insight in this discussion: applications initially

    focused on data and computation, then information and communication, and now insightand experience. Most techniques in data engineering were developed to meet the needs of

    data systems of the last quarter of the 20th century. Data engineers must now address the

    needs ofthis century.

    2.0 Emerging Applications

    Some of us are old enough to remember the gentler times of database engineering. Todefine the requirements and structure of a database application, we merely looked at the

    corporate database. An entity, usually an employee, consisted of alphanumeric fields,each of which represented some attribute. Users posed a query to discover an employee

    attribute or to find all employees that satisfied certain attribute-related predicates.But although 2003 users have vastly different expectations, most databases still

    have the 1970s philosophy: Users ask queries to get answers in an information-centricenvironment. This works well as long as all the users have the same requirements. The

    database is a resource for many people and provides a well-defined environment forarticulating queries.

    Problems come when user requirements differ or when users dont know enoughto ask the right questions. I see some of these problems already in applications that at first

    glance dont appear to be database applications. But thats why we need out-of-the-boxthinkingto recognize that these applications are in fact the future of databases.

    2.1 Personalized Event Experience

    Suppose you are interested in cricket. A match lasts for five days and may not result in adecision, even after 30 hours of play. You may be an avid fan, but you dont want to

    spend five days watching a game video. You want to watch specific events in the game,how specific players perform, all scoring or all defensive highlights of a specific team, or

    any player comments on a specific play. You may also want to understand how the gameevolved by seeing a fast animation of statistics related to different categories, or see a

    particular event from several camera perspectives or listen to different commentatorsdescribe the same event.

    All these desires translate to query types that the game database must answer, andit must present the answer in way that lets you enjoy the game. Current portals isolate

    information in silos so completely that users spend more time trying to navigate withinand across silos than in enjoying the game. Also, users, not portals, should determine the

    events of interest.

    2.2 EventWeb

    Web search engines, for example, are notorious for their lack of discrimination. XML has

    not solved these problems because ultimately no search engine can anticipate a usersexact needs. The semantic Web is receiving a lot of press and people are pinning many

    expectations on personal agents to help find the right information and services.Im not convinced that this will solve the problem. The semantic Web still follows

    the legacy of Gutenberg. It is a web of pages that are predominantly prepared in adocument mode. Again, this is fine if all you want are descriptions. But we can do so

  • 8/8/2019 Data Engineering

    3/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    much more. Visualize instead a web of events, in which each node represents an event,past or present. Each event is not just someones description of the event or some

    statistics related to it. It is the event, brought to you by one or more cameras,microphones, infrared sensors, or any other technology that lets you experience the event.

    For each event, all the information from sensors, documents, and other sources is united

    and presented to the user independently of the source. The user then experiences thepreferred parts of a particular event in the preferred medium.In this vision, events are treated equally. The archived video of a news event is

    accessible in the same way as the CEOs Web cast or your sons first football game. Thesource can be anything from CNN to the local elementary schoolwhatever or whoever

    generates events worth archiving. And perhaps most important, because it is not textcentered, the event Web will reach the 90% of humanity who either cannot grasp or

    cannot access current information and communication technology.I see the rudiments of this vision already. Stores are stocking Web cams in every

    shape and size at prices that even students can afford. Sensors that were once discrete arenow being connected to form networks for various Internet applications, from a sushi bar

    in San Francisco to an ant colony in Lansing, Michigan. Multimedia phones with built incameras will be next. In short, we are witnessing the beginnings of the event Web

    explosion; just as decades ago we saw the document Web transform.

    2.3 Scientific Applications and Data Warehouses

    Transforming data from disparate sources to a sensory medium that people can

    experience provides the opportunity for deeper insight into a problem or situation. Thiswas the basis for the use of the oscilloscope and many similar instruments. Visualization

    techniquesa more modern oscilloscopeemerged as a powerful analysis and insightgeneration environment. Now data for many applications, from customer transactions in a

    retail outlet to bioinformatics, is collected and stored in data warehouses.The data sources for these are diverse and the volumes very large. To use the data

    in data warehouses effectively, we need tools and techniques to transform disparate datainto a form that will let people experience and gain insights into the situation. Current

    database technology was developed for applications that were challenging three decadesago. With the completely changed landscape of data, we require new technology to

    explore this data in most cases to generate insights into an event. We cannot do this in aquery-centric environment. The oscilloscope brought an experiential environment to

    science and technology in the last century. We require an oscilloscope to bringexperience to computing in this century.

    2.4 Situation Monitoring and Analysis

    The increasing number of applications requires that we establish a large network ofdisparate data sources, including both sensors and human-entered data systems, to

    produce a data stream. The data from these sources must be interpreted and combined toprovide an overall picture of the situation. This data is used for warning about potential

    disastrous events, to provide the status of activities at different locations, and to analyzethe causes of past events. In many cases, using a real situation in the past, different what-

    if analysis must be done to develop solutions for similar situations. In all suchapplications, real-time data analysis must combine with real-time data assimilation from

  • 8/8/2019 Data Engineering

    4/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    all sensors to provide a unified model of the situation. Users should not see the situationas raw data from different sensors, but as the evolving big picture of the situation. Thus,

    in an emergency situation, users see not just isolated sensor streams from differentlocations, but situation characterized as needing medical help, fire engines, or police. In

    all these applications, users are interested in the real-world situation from a users

    perspective not the data from a specific source. Sensor data is but one of several sourcesthat form the model of the situation.

    3.0 Common Data Characteristics

    On the surface the requirements for experiential applications seem very different, but

    upon closer examination they have important similarities:

    Spatio-temporal data is important. Different data sources provide information to form the holistic picture. Users are not concerned with the data source. They want to know the result of the

    data assimilation (the big picture of the event).

    Real-time data processing is the only way to extract meaningful information Exploration, not querying, is the predominant mode of interaction, which makes

    context and state critical. The user is interested in experience and information, independent of the medium and

    the source.

    Data sources are broadly of two types, precise and imprecise, and user requirements forthe data fall into roughly two categories, insight and information. The matrix in Figure 1

    captures the tensions between these four characteristics. In many situations, I know thedata source precisely, even though it may be distributed. In other cases, I know only that

    what I need is available from somewhere. Likewise, in some situations, I am trying togain insights into the behavior of a system, event, or concept, so my primary need is to

    explore and understand. In other situations, I need information and I want a specificanswer.

    Predictably, databases are in the intersection of precise and information, thebottom left quadrant. Nothing beats them as a means of getting information from a

    precise source. In the top left quadrant are visualization environments and tools,promising ways to gain insight from a precise source. In the bottom right quadrant are

    search engines. Few people will argue that search engines are an imprecise source.However, their intention is to provide information, not further exploration. Unfortunately,

    exploration does occur, but it usually to find a suitable match for the query, not to explorean event further. Finally, in the top right quadrant is the intersection of insight and

    imprecise source. This intersection produces what I call the experiential environment, anew way of presenting data that will become increasingly common in most data-intensive

    applications. This will then improve techniques in the other three quadrants.

  • 8/8/2019 Data Engineering

    5/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    Figure 1. Data sources and access goals

    4.0 Experiential Environment

    An experiential environment is the collection of sensors and other data sources presentedin a unified model that lets the userdirectly apply his senses to observe data and

    information of interest related to an event and to interact with the data according to hisinterests.

    Current database systems are essentially stateless, which lets them interact withmultiple users in multiple contexts as efficiently as possible. The drawback is that people

    must adapt to the machines way of doing things. In the early days of computing, peoplewere in awe of the machine, so they were willing to rearrange their perspectives and

    activities to be part of its environment. Now they are far less reverent and demand thatthe machine make the adjustments. My Yahoo, My AOL, and other personalized Web

    pages reflect this my way shift. People expect their systems to remember what theylike, where they went, what they need to do next, and where they like to shop. The

    system must remember how they got to a particular state, to answer questions in thecontext of that state, and to evolve to another state in a kind of symbiotic partnership with

    the user. E-commerce recognizes this shift, which is why Amazon suggests other booksyou might enjoy and displays books and other products that you most recently browsed.

    Ironically, this relationship brings out the best in both partners. Humans areefficient conceptual and perceptual analysts, but relatively poor in mathematical and

    logical evaluation; computers are exactly opposite. Computers can perform mathematical

    Current Databases

    Visualization

    Search Engines

    ExperientialEnvironments

    Precise Source Imprecise Source

    Information

    Insight

  • 8/8/2019 Data Engineering

    6/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    and logical operations millions of times faster than any person, but their perceptualcapabilitieseven after all the progress of the last 40 yearsremain relatively primitive.

    Yet current databases present sequential and logical information to humans andexpect computers to detect complex patterns. The powerful synergy of human and

    machine is short-circuited. If we use computers and users synergistically, we can develop

    the experiential environment.

    4.1 Emphasis on natural

    We are interaction-oriented creatures. It is how we learn about our environment. Weperform an action, see its effect on the environment, and act in response to that. In the

    typical query system, however, we articulate a query, wait for the system to provide ananswer, analyze the response to see if the system understood what we really wanted, and

    more often than not, formulate a new query and start all over again. This process ispainfully out of synch with our natural desire to learn through interaction. In experiential

    environments, users get data that they can easily and rapidly interpret using their naturalsenses. Once they interpret the data, they can interact with the data set to either get a

    modified dataset or to perform certain actions. At any given time, the data set from theprevious interaction is available and the user interacts with the system as a result of this

    holistic information.

    4.2 Similar query and presentation spaces

    Most current systems use different query and presentation spaces. A query environment

    helps users articulate their queries, the system computes the results of the query, andpresents them in a very different form. Search engines provide a box for the user to enter

    keywords and the system responds with a list of thousands of entries spanning hundredsof pages. A user has no idea how the entries on the first page relate to the entries on the

    113th page (if she gets that far), how many times the same entry appears, and often howentries on the same page can possibly have anything in common. Contrast this to video

    games, in which the player formulates a query by selecting some action on the screen andthe system presents the result as some change on the same screen. Here query and

    presentation spaces are the same, and the relationships among all relevant objects areclearly presented in a form that is obvious to the user.

    4.3 State and context

    People dont willingly change their physical and mental states abruptly. A gradual shift incontext is much preferred, even in natural language, which is why we take pains to insert

    transitions like on the other hand to signal a contrast or similarly to signal acomparison. People simply operate better in known contexts because they can understand

    enough about the relationships among different objects in space and time to drawinferences about them or create models of them. We live in a world that is continuous in

    both space and time, so we are most comfortable organizing our knowledge of oursurroundings in that manner.

    The space-time continuum is foreign to databases, and information systems in general.How can a stateless system respect the demands of spatio-temporal data? Databases may

    be efficient, but being stateless has distinct disadvantages. Latency is a big one. Not onlyis the system slow, it doesnt even give feedback about its lack of progress. The perpetual

  • 8/8/2019 Data Engineering

    7/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    hourglass, the bar that takes an agonizing amount of time to fill, or the endless flittingpages are the only indications that the system hasnt completely abandoned its task. Some

    Web sites try to reveal the number of bytes left, which is marginally useful as long astraffic allows. Nothing, however, will induce users to explore if it takes too long to move

    from place to place. When latency is low, on the other hand, exploration is much more

    pleasurable. Video games are an example. Their appeal is due in part to their near-zerolatency.

    4.4 Multimedia immersion

    Video games are also popular because they provide a powerful visual environment, and

    in some cases tactile inputs. Early computing environments were strongly text orientedbecause they had to be. The technology couldnt support any alternative form. There is no

    longer a reason to avoid powerful visual and audio presentations. Other senses may also become a familiar part of the computing environment. In some cases, like chemical

    industry or culinary applications, smell could make a more compelling and immersivepresentation.

    I allude to video games often because they are a powerful example of small-scaleexperiential environment. Anyone can use them, even children who cannot yet read or

    write. A video game can keep even these young users engaged for hours, a testimony toits natural interactive ability.

    5.0 Assimilating Data into Unified Information and Knowledge

    In the experiential environment, data can be anythingaudio, video, text, alphanumerics,infrareddepending on the sensors employed. Current databases and information

    systems were designed using data obtained and mediated by people, so predictably theyended up in alphanumeric form or in text. Database designers developed techniques to

    organize such data and deal with it effectively. New applications require not just the datarepository, but complete environments for information and insight.

    This complete environment requires rethinking the way we index data. Currentindexing techniques for different data types depend on metadata for that particular type.

    Metadata plays a key role in introducing semantics and is important in determining howdata will be used. Schemas provide semantics in relational tables.

    XML has become very popular for introducing semantics in text. Ironically,XML came about because researchers were trying to develop automatic approaches to

    deduce semantics from the data. When it became clear that reaching this goal was farmore complex than they had thought, researchers turned to a mark-up approach to

    semantics and threw in languages as well. I fail to understand the degree of excitementabout XML. Clearly, it will solve some interesting problems, but it is notthe panacea

    many people believe.XMLs utility is limited to the introduction of semantics in strongly human-

    mediated environments. For sensory data like audio and video, feature-based techniquesare much more useful. Here the goal is to identify some features in data that will serve as

    a bridge between data and semantics. The idea of using clearly detectable attributes asfeatures to infer semantics seems an obvious solution.For images, commonly used

    features are color histograms and simple measures of texture and structure. Mosttechniques measure global images, not objects within images, yet people are most often

  • 8/8/2019 Data Engineering

    8/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    interested in objects. How to get to those objects is a problem that most data engineersare ignoring.It is true for most other signals in medical, seismic, and other applications

    also. Signals are usually indexed using features that capture global or semi-globalfeatures, while semantics usually requires the structure of local features.

    Unfortunately, these techniques are still immature, primarily because researchers

    are interested in developing general-purpose techniques rather than restricting theirsystem to a specific context. Researchers can learn from the success of natural languageor speech recognition systemsall successful systems work in a specific context.

    5.1 Breaking down silos

    Everyone gets information about objects and events from different sources in differentdata types. What I know about the war against terrorism is based on what I saw on TV,

    read in newspapers and magazines, heard on the radio, and discussed with my friends.That is, my perception is based on information Ive assimilated from multiple data

    sources. Somehow, and quite naturally, Ive assimilated all this information andrepresented it in a unified form that is independent of the individual data source.

    Information systems, in contrast, create data silos. The metadata is defined andintroduced for a data of a particular type, which is indexed and neatly stashed in its own

    place. A video collection cannot interact with a text collection to produce a videotextcollection. Indeed, the silo structure is strongly defined with little or no interaction among

    silos, as Figure 2 shows.

    Figure 2. Different data sources have different indexing mechanisms, but these sourceslive in their own silos.

    The challenge to the database community, then, is to break down these silos to unify

    information. This requires more out-of-the-box thinking because most data sources aredesignedto behave like independent silos. Their creators assume that after the integration

    system analyzes the silos and extracts their metadata, it will somehow combine the

    Vide Audi SensorData Tex

    Inde Inde Inde Inde Inde

  • 8/8/2019 Data Engineering

    9/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    metadata to provide correct results. Indeed, many current research efforts are aimed atthis kind of solution.

    Researchers also form strong silos. I know from experience in many research areas,including image and video database research, that tunnel vision is common. Just as the

    six blind men had vastly different ideas about the size, shape, and function of an elephant,

    so the database, computer vision, and information retrieval communities have diverse(and equally stubborn) views of an image database. Having all these people developsystems without communicating is no more productive than having five students in

    separate rooms attempt to produce a coherent thesis.Perhaps the challenge is to break down both kinds of silos.

    5.2 Information Assimilation

    Many system engineers, particularly designers of control and communication systems,use a strong, domain-based information-assimilation approach to estimate system

    parameters that uses many disjoint and disparate information sources. The parameters ofthe mathematical system model are successively refined by observing the data as it

    becomes available. In this approach, each data source is just one more source thatcontributes to the models refinement, and the goal is to get the most precise model

    possible. At some point, it is possible to completely ignore data from a specific source.Thus, a data source is just that, a data source, and the model represents the current

    knowledge about the system, knowledge that in turn is based on evidence from all thedata sources. Conceptually this approach is very different from current information

    integration, in which the system analyzes a particular data source and then combines itsresults with those of other data sources.

    A very important result of the assimilation approach is that the system canefficiently deal with real-time data by keeping only what is important for the goal of the

    system. Most applications collect real-time data, and it is very important to know that alldata is not equal and the importance is context dependent. Data-engineering systems will

    have to learn to ignore data.This approach also allows a very smooth and effective introduction of semantics

    in the process. Here the semantics will be brought by modeling the data and informationflow in the system, representing states and state transitions, and the role of different

    objects in different states. Thus, this modeling process will help in representing data aswell as in analyzing the data as it comes to extract only meaningful information.

    5.3 Event Graphs for Unified Indexing

    An approach to data silo breakdown, which I and my colleagues at PRAJA and at UCSan Diegos Electrical and Computer Engineering Department developed, is to build a

    unifying indexing system that introduces a layer on top of the metadata layer of each datasilo, or disparate data source. The layer uses an event-based domain model and metadata

    to construct a new index that is independent of data type. We decided to use the eventasignificant occurrence or happening located at a single point in space and timeas the

    basic organizational entity for the unifying index because it has many applications andtheories in human memory organization. An application domain can be modeled in terms

    of events and objects. Events are hierarchical and have all desirable characteristics that

  • 8/8/2019 Data Engineering

    10/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    have made objects so popular in software development. In fact, events could beconsidered objects whose primary attributes are time and space.

    In our approach, an event graph parses the data as it is coming and assimilatesdata to build an environment model that reflects knowledge about the event on the basis

    of information collected so far. As Figure 3 shows, event graphs essentially create a list

    of spatio-temporal events as they take place. This becomes the database that describesdomain semantics and links these events to individual data streams. Users can study asmany or as few of these as they want or they can visit the entire stream to experience the

    full event. Event graphs also capture the entities and their roles in the event, the eventslocation and time, and event-transition information. They capture causality in an event-

    transition mechanism.An event base stores the name and nature of the event and all other relevant

    information. The relevant information may not be available at the time the event iscreated. If so, when it becomes available, the system attaches it to the event. For example,

    comments in the local newspaper about the CEOs talk can be linked to the talk whenthey become available.

    Thus, the event base is an organic database that keeps growing as a result of manydifferent processes running. It differs from the current database form in this respect.

    The event base also stores links to original data sources, which means the systemcan present the appropriate media in the context of a particular event. Thus, when the

    cricket fan accesses the match to see all Virendar Sehwags boundaries in the first inning,the system can suppress all other shots and show only what the user wants.

    The user interacts with the event base directly, and the event base uses originaldata sources as required. This has several important implications: The system can

    preprocess important information related to events and objects according to its domainknowledge. It can present information using domain-based visualization, and it can

    provide unified access to all information related to an event, independent of the time thedata became available or was entered in an associated database.

    Because of these characteristics, unified indexing is the backbone of anexperiential environment.

    DataSources

    Eventbase

  • 8/8/2019 Data Engineering

    11/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    Figure 3. Event graphs unify different data sources by providing a semantic indexing and

    linking approach.

    6.0 WYSIWYG Search

    As Figure 4 shows, an event has three dimensions: what it is (its name and class),

    where it took place, and when it took place.A user navigating through an event base isinterested in finding all events in a certain class that occurred at a particular location and

    time. The event name captures the events purpose and identity, the event class isorganized in an ontology defined for the application. The What part on the screen (top

    left) presents a list of all events modeled in the system. Location can be specified in somekind of mapgeographic, schematic, or conceptual. Time is organized as a timeline.

  • 8/8/2019 Data Engineering

    12/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

  • 8/8/2019 Data Engineering

    13/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    Figure 4. An approach to show events and create a WYSIWYG search environment.Clear What-When-Where areas provide a multidimensional WYSIWYG search

    environment.

    Users can select one or two event classes or navigate through class ontology hierarchies

    endlessly; there is no theoretical limit on the subclass structure.The depth of thehierarchy depends on the model used in the application and the data available. Selecting aclass automatically selects all its subclasses. To navigate through event location and time,

    users either zoom or move in different directions, similar to the way video game playersselect parts of a map, from a room to an entire world. The event timeline could be

    anything from microseconds to centuries, or even light years.At all points of the search the user experiences What-You-See-Is-What-You-Get

    (WYSIWYG) characteristics. Once a user selects event classes, part of the map, and time,the system presents all events and their selected attributes at all three places. In the figure,

    the user has selected the inventory class for SBU accessories worldwide in 2001, which isakin to the text query, Show me inventory status of all the SBU accessories worldwide

    in 2001. The event list (bottom half of the screen) shows the details of the inventories.The colored dots on the map show the location and status of the inventory: needs

    immediate attention, needs some attention, okay. To avoid confusion, this example doesnot show a color-code list and timeline, but if the user selects an item in the list, the

    display will change color to highlight that selection and its corresponding symbols in thetime and location areas. The exact mix of color and symbols depends on the application.

    By displaying events on a map as well as on a timeline, the WYSIWYG approachmaintains event context. The user can then refine the search through any window, say by

    zooming into and out of the map or timeline or going left or right. A change in one partautomatically updates results in the other windows. Consequently, the query and

    presentation spaces remain equal. Also, as users change the search criteria, they getimmediate results with minimal latency. In most applications, the results can be

    instantaneous. Users can experiment with the data set on their own terms and developinsights at their own pace, always with the events context. The system displays results

    continuously, making it easier to hypothesize about data relationships. It will be possibleto test a hypothesis by linking such a systemto data-mining tools that would let the user

    explore large data warehouses.If a user wants to know more about the event, he can explore it by double clicking in

    any of the three windows (what, when, or where). The system then provides all the datasources (audio, video, or text) and any other event characteristics packaged in an event

    envelope, which the system automatically generates on the basis of the informationassimilated in the event base. The system can automatically create event envelopes using

    the information in the event base as it is created. The user can launch a variety of sourcesfrom the envelope, and they will open in the desired mode in either a different window

    than the user originally selected or in the same window. The system accesses andappropriately presents much of the information in the event envelope through links to

    original sources, such as programs launched to present results of a particular dataset or asimulation.

    An event envelope is a powerful mechanism that unifies the results of many complexoperations. If selected variables have dynamic attributes, the event envelope can present

  • 8/8/2019 Data Engineering

    14/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    historical attributes for those variables. Users can then save an event envelope as a snapshotthe particular state of an eventand compare it with other snapshots representing

    later states. The snapshot button (top right in Figure 3) lists all event envelopes the userhas saved. An event envelope can be sent through e-mail and hence can help build

    communities around specific themes. Amateur astronomers are interested in scanning the

    sky for near-Earth-objects like comets. Clubs could exchange event envelopes andcommentary about the images in the same e-mails. Moreover, the envelopes wouldcontain links to details like magnitude and angular distance from a known star. This kind

    of rich communication increases both individual and community knowledge.

    7.0 Applications

    Here we briefly describe three applications to give an idea of what could be done in

    this environment.

    5.1 Football

    Figure 5: Event graph of a football game

    The graph in Figure 5 is of the events in a football game. The text shows several levelsof event hierarchies, from the complete game to a particular drive. For simplicity, the

    figure does not show levels below a drive. The graph also shows potential transitionsfrom one event to the other in the game in terms of downs. Thus, the graph represents a

    subset of an event-transition diagram for the game.

    Drive

    1st

    Dow

    2nd

    Dow

    3rd

    Dow

    4th

    Dow

    FG

    TD

    TO

    n Gamen Quarter 1

    Team A:Drive

    Team B:Drive

    n Quarter 2 Team A:Drive Team B:

    Drive

    n Quarter 3 Team A:

  • 8/8/2019 Data Engineering

    15/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    The model is generic for the game. The sequence of events generated depends on theparticular game. Figure 4 shows only show a small subset of the event model to give a

    flavor of application.Our data sources for the game included video (plus audio) from multiple cameras;

    play-by-play information, which various companies generated and made available as a

    data stream; and a player and statistics database. A rule-based system decided if aparticular play (an event) would be of interest to anyone and thus whether or not to savethe related video.

    The system parsed the play-by-play data stream, applied the rule base, and prepared anevent base for the game. As Figure 6 shows, the event base appeared to the user as a

    time machine, in that users could go to any moment and see all the related statistics,including score and timeout left, rushing yards, and first downs for each team at that

    particular time.This display is like the one in we discussed above but designed using the domain

    information for football and hence familiar to football fans. By clicking on the time line auser can select any time instant and see what was the state of the game at that time. By

    moving the pointer on the timeline, users can see how the game evolved. They couldfilter events of their choice and set them in standard football representationthe football

    field at the bottom of the screen. By double clicking on a play, they could get moreinformation about the play or watch a video of it through the event viewer? As they

    watched scoring plays, from various angles, they could click on a player to get moreinformation.

    Twenty five college football teams used this system. These fans could not watch thegame on national TV either because the game was not televised or because they were in a

    wrong place, but could enjoy football games of their team in a compelling way using thissystem. They could watch the same play of their favorite player from multiple angles to

    gain insight in what really happened.

  • 8/8/2019 Data Engineering

    16/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    Figure5 : Experiential environment for Football fans to enjoy multimedia presentation

    in a time-machine format.

    Insight building via experience and exploration is not limited to entertainment

    applications. In a modern enterprise, line managers would like to identify in real time thepotential problem areas, how they got here, and how they can change things. The focus is

    on performance indicators, which measure the discrepancies between planned and actual performances, and on the relationships among performance indicators and available

    infrastructure, environmental factors, promotional efforts, and so on. It is not enough todiscover a problem; companies want to analyze why the problem arose so that they can

    improve their processes. In this context, the why includes activities related to the problemas well as its historical perspective.

    +TargetSalesAmount

    +ActualSalesAmount

    +SalesAmountDiscrepancy

    +TargetSalesCalls

    +ActualSalesCalls

    +SalesCallsDiscrepancy

    MonthlySalesActivity

    +TargetSalesAmount

    +ActualSalesAmount

    +SalesAmountDiscrepancy

    +TargetSalesCalls

    +ActualSalesCalls

    +SalesCallsDiscrepancy

    YearlySalesActivity

    1

    12

    DailySalesAmount

    DailySalesCalls

    QuarterlySalesAmount

    QuarterlySalesCalls

    What

    Sales

    Overall

    By Customer

    Customer1

    Customer2

    By Product Category

    Overall

    Cat1

    Overall

    By Product

    Product1

    Product2

    Cat2

    Overall

    ...

    Inventory

    ....

    Markerting

    ....

    Figure 6: Event graph and taxonomy of a Demand Activity Monitoring application

    Figure 6 shows an event graph for a sales forecast and inventory monitoring systemdesigned to monitor an automotive parts manufacturers key activities. These include

    sales (monthly, daily, and hourly forecast target and actual for different sales regions) and

  • 8/8/2019 Data Engineering

    17/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    inventory (monthly, daily, and hourly available inventory for different warehouses).Activities are rolled up temporally (hourly _ daily _ monthly) and by various actors

    (customers, parts, parts line, and so on). Figure 7 shows a screen shot of EventViewer forthis application. Performance indicators for each activity are mapped to red, yellow, and

    green based on domain specific criteria.The display in figure 7 is the close-up version of

    the display in figure 4 shown earlier. It is easy to see that one can select differentgeographic regions and different parts to understand what was going on in that part of theworld. These displays provide a holistic picture of the situation to an analyst who can

    then drill deeper into the situation. The system allows those tools, but we will not discussthose things here due to space limitations.

    Figure 7: Another display of the inventory application. Compare this display to the

    one in figure 4 to see how the system can be used in WYSIWYG mode.

    Conclusion

    Rapid advances in many related areas have brought in some interesting challenges for the

    data engineering community. Traditional database techniques need to be reconsideredand readapted for the new applications. Relational approaches are powerful and will still

  • 8/8/2019 Data Engineering

    18/18

    Draft: Keynote Talk at Int. Conf. On Data Engineering 2003.

    be useful as a backend. But the front end of these systems requires data engineering thatis very different from what we have done so far. The challenge is take a more solution-

    oriented perspective or be boxed into back-end repository management.Some new attributes of data emerge as dominant issues: semantics, multimedia,

    live, location sensitivity, and separate streams of sensor and other data. To unify all

    sources of information, events appear to offer a powerful approach for modeling,managing and presenting data. I believe that event-based experiential environments willbe useful in many emerging applications. The thoughts and ideas I have presented are

    still in the conceptual stage. We have a long way to go in refining this approach to makeit practical, but it is clear we must take a new path, one that is outside conventional

    thinking, if we are to keep pace with and enable these new applications.

    Bibliography

    [1] A. Katkere, S. Moezzi, D.Y. Kuramura, P. Kelly, and R. Jain, Towards video basedimmersive environments, Multimedia Systems, Vol. 5, No. 2, pp.69-85, 1997.

    [2] S. Moezzi, A. Katkere, D.Y. Kuramura, and R. Jain, Reality modeling andvisualization from multiple video sequences, IEEE Computer Graphics and

    Applications, pp. 58-63, No. 1996.[3] A. Katkere, J. Schlenzig, A. Gupta, and R. Jain, Interactive video on WWW: Beyond

    VCR like interfaces, Computer Networks and ISDN Systems, Vol. 28, pp. 1559-1572,1996.

    [4] Ramesh Jain and Arun Katkere, Experiential environment for accessing multimediainformation, Proc. Of Second International Symposium on Multimedia Mediators,

    University of Tokyo, Tokyo, March 7-8, 2002.[5] Y. Roth, R. Jain, "Knowledge Caching for Sensor-Based Systems."Artificial

    Intelligence, 2-24. 1994.[6] Simone Santini and Ramesh Jain, A Multiple Perspective interactive Video

    Architecture for VSAM, Proceedings of the 1998 DARPA Image UnderstandingWorkshop, Monterey, CA, November 1998

    [7] Simone Santini and Ramesh Jain, Semantic Interactivity in Presence Systems,Proceedings of the Third International Workshop on Cooperative and Distributed Vision,

    Kyoto, Japan, November 1999.