28
Data Architecture, Data Warehousing, and Master Data Management by Ken Orr, Fellow, Cutter Business Technology Council In large organizations everywhere, there is increasing interest in data architecture and associated issues such as data warehousing strategies and master data management. This Executive Report addresses the nature of data architecture, how “operational data” and “informational data” differ, and where data warehousing fits in. The report asserts that over the last 30-40 years, there has been a number of profound developments in data architecture: the emergence of database management systems; the emergence of relational database systems; and the emergence of data warehousing. Additionally, we discuss how master data management systems and data hubs are becoming mainstream ways of linking vital information. Business Intelligence Vol. 7, No. 1

Data Architecture

Embed Size (px)

DESCRIPTION

Data Architecture is the key IT asset that provides the foundation for the entire Enterprise Architecture.

Citation preview

Page 1: Data Architecture

Data Architecture, DataWarehousing, and MasterData Management

by Ken Orr, Fellow, Cutter Business Technology Council

In large organizations everywhere, there is increasing interest in data

architecture and associated issues such as data warehousing strategies

and master data management. This Executive Report addresses the

nature of data architecture, how “operational data” and “informational

data” differ, and where data warehousing fits in. The report asserts that

over the last 30-40 years, there has been a number of profound

developments in data architecture: the emergence of database

management systems; the emergence of relational database systems;

and the emergence of data warehousing. Additionally, we discuss

how master data management systems and data hubs are becoming

mainstream ways of linking vital information.

Business Intelligence

Vol. 7, No. 1

Page 2: Data Architecture

About Cutter ConsortiumCutter Consortium is a unique IT advisory firm, comprising a group of more than 150 internationally recognized experts who have come together to offer content,consulting, and training to our clients. These experts are committed to delivering top-level, critical, and objective advice. They have done, and are doing, ground-breaking work in organizations worldwide, helping companies deal with issues inthe core areas of software development and agile project management, enterprisearchitecture, business technology trends and strategies, enterprise risk management,business intelligence, metrics, and sourcing.

Cutter delivers what no other IT research firm can: We give you Access to theExperts. You get practitioners’ points of view, derived from hands-on experience withthe same critical issues you are facing, not the perspective of a desk-bound analystwho can only make predictions and observations on what’s happening in themarketplace. With Cutter Consortium, you get the best practices and lessons learnedfrom the world’s leading experts, experts who are implementing these techniques at companies like yours right now.

Cutter’s clients are able to tap into its expertise in a variety of formats includingprint and online advisory services and journals, mentoring, workshops, training,and consulting. And by customizing our information products and training/consultingservices, you get the solutions you need, while staying within your budget.

Cutter Consortium’s philosophy is that there is no single right solution for allenterprises, or all departments within one enterprise, or even all projects within adepartment. Cutter believes that the complexity of the business technology issuesconfronting corporations today demands multiple detailed perspectives from which acompany can view its opportunities and risks in order to make the right strategic andtactical decisions. The simplistic pronouncements other analyst firms make do nottake into account the unique situation of each organization. This is another reason topresent the several sides to each issue: to enable clients to determine the course ofaction that best fits their unique situation.

For more information, contact Cutter Consortium at +1 781 648 8700 [email protected].

Cutter Business Technology Council

Access to the

Experts

Tom DeMarco Christine Davis Lynne Ellyn Jim Highsmith Tim Lister Ken Orr Lou Mazzucchelli Ed YourdonRob Austin

Page 3: Data Architecture

Print, film, magnetic, andoptical storage media pro-duced about five exabytes1

of new information in 2002.Ninety-two percent of thenew information was storedon magnetic media, mostlyin hard disks. [6]

Data architecture is one of themost important things that any ITorganization can do, because datais one of the principle things peo-ple want from IT, and ultimately,data is what top management ispaying for. Unfortunately, for avariety of reasons, IT has neverquite lived up to its commitmentto provide the right information tothe right people at the right time.Survey after survey shows that top

managers are sure their organiza-tions have the data they wantsomewhere in the organization,but they don’t know how tolocate, manipulate, and extract it.

Over the last four decades, thisdata access problem has beenattacked multiple times. In the1960s and early 1970s, the answerwas database management sys-tems (DBMSs). Then in the late1970s and early 1980s, it was 4GLs,spreadsheets, and executive IS. Inthe 1990s, it was multidimensionaldatabases, business intelligence(BI), and data warehousing. Eachof these approaches has solvedpart of the overall informationaccess problem, but getting at theright information in near real time

is still tough. Even in old line com-panies, closing the books at theend of a month or the end of aquarter can take too much time.

For a long period, nonoperationalIS was referred to as “end-usercomputing.” This term was occa-sioned by the notion that the oper-ational folks who enter the rawtransaction information were“information providers” or “infor-mation stewards,” while thosewho used the data for nonoper-ational functions were “informa-tion users.” Historically, end-usercomputing was built one usercommunity (marketing, productdevelopment, R&D) at a time, andover time; this created an enor-mous number of problems.

by Ken Orr, Fellow, Cutter Business Technology Council

Data Architecture, Data Warehousing,and Master Data ManagementBUSINESS INTELLIGENCEADVISORY SERVICEExecutive Report, Vol. 7, No. 1

1How big is five exabytes? If digitized with full formatting, the 17 million books in the Library of Congress contain about 136 terabytes of information;five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress’s bookcollection.

Page 4: Data Architecture

The biggest problem with thebifurcation of operational andinformational data is the difficultyin getting what top managementoften refers to as a “single view ofthe truth.” In most cases, largeorganizations have many differentviews of the truth. When man-agers obliquely refer to data qual-ity problems, what they’re sayingis that they don’t know which viewof the truth to rely on (e.g., the onefrom the sales department or theone put together by the folks inaccounting). The demand formore corporate transparencybrought on by Sarbanes-Oxley(SOX) and Basel II has raised thestakes for improving data quality.Because top managers arerequired to ensure that the publicview of the truth (i.e., public finan-cial reports) is correct, they arenow willing to invest more in theirunderlying information infrastruc-ture. The problem is that if youdon’t know which version of thetruth corresponds with the realworld (and it often turns out thatnone of various truths are correct),then it is hard to swear by any setof numbers.

Moreover, the Internet andelectronic communicationhave vastly complicated ouralready vexing data problems.In addition to facilitating theaccess to the structured datathat exists in enterprise databases,

organizations everywhere nowhandle increasingly largeramounts of unstructured data,including e-mail, attachments,voice mail, multimedia, Web con-tent, and so on. What started as atrickle has turned into a flood.Today, as organizations becometruly paperless, the informationthat used to be on paper is now inelectronic form, and somethingmust be done with it. Most formalcommunication these days is viae-mail and attachments, so muchso that the unstructured data in anaverage organization is growingmuch faster than its structureddata, with fewer good processesand tools to manage it.

To make matters worse, struc-tured business data (transactions,product updates, customerupdates) is managed by onegroup (database administration)while unstructured data is oftenmanaged by another (networkadministration). Clearly, this situa-tion requires new approaches andnew governance.

Data architecture is one newapproach for solving largeenterprise data problems andissues. Data architects are con-cerned with all the data in theorganization and how the variousdata pieces fit together with all theother pieces. It is concerned notjust with the massive redundancy

that exists in most organizationsbut, more importantly, with themassive inconsistency brought onby the redundancy.

Although data architecture is anew box on the organizationalchart, it is not a new function;indeed, various groups and indi-viduals in our organizations havebeen trying to manage data fordecades under difficult circum-stances. But data architecture hasbecome such a critical issue thatit is now a top management con-cern, and as a consequence, thefunction is being pushed towardto the top of the org chart. Data isthe lifeblood of the electronic real-time enterprise, and data manage-ment is the name of the game.

As organizations become moreserious about managing theirdata — all of their data — newapproaches and technologies getinvented, tested, and deployed.Tools for accessing, integrating,and rationalizing data from lots ofdisparate sources are perfected.

There tends to be a circular pat-tern in systems development inwhich developers, working fordifferent users, develop a seriesof systems. Then, over time, thesmart ones notice that certainmajor data components are con-stantly reappearing. For example,in manufacturing applications,

VOL. 7, NO. 1 www.cutter.com

22 BUSINESS INTELLIGENCE ADVISORY SERVICE

The Business Intelligence Advisory Service Executive Report is published by Cutter Consortium, 37 Broadway, Suite 1, Arlington, MA 02474-5552,USA. Client Services: Tel: +1 781 641 9876 or, within North America, +1 800 492 1650; Fax: +1 781 648 1950 or, within North America, +1 800 8881816; E-mail: [email protected]; Web site: www.cutter.com. Group Publisher: Chris Generali, E-mail: [email protected]. Managing Editor:Cindy Swain, E-mail: [email protected]. ISSN: 1540-7403. ©2007 by Cutter Consortium. All rights reserved. Unauthorized reproduction in any form,including photocopying, faxing, and image scanning, is against the law. Reprints make an excellent training tool. For information about reprintsand/or back issues, call +1 781 648 8700 or e-mail [email protected].

Page 5: Data Architecture

there is always a need for “cus-tomer data” and “product data,”and in HR applications there is aneed for “employee data” and“job data.” In purchasing, there isalways a need for “vendor data”and “vendor product (asset)data.” This discovery was firstmade in the early days of themainframe and ultimately resulted(in the 1960s) in the developmentof the first serious DBMSs (e.g.,GE’s Integrated Data Store,CINCOM’s Total, and IBM’sInformation Management System[IMS]). These databases werebuilt around the idea that, whiledifferent systems often did differ-ent things, most systems requiredaccess to a common set of coredata and that it didn’t make senseto have dozens of “customer files”or dozens of “product files.”

The first systems that utilizedcommon database software weredramatic improvements overprevious applications. Instead ofcreating new definitions for “cus-tomer,” “product,” or “order,” itwas possible to reuse definitionsin developing new applications.Better yet, DBMSs allowed devel-opers to use data that alreadyexisted and, if certain dataelements were missing, to simplyadd what was new (attributes,tables, etc.) to create subsequentapplications. At least that wasthe plan.

But in reality, no matter how sen-sible the idea of common data-bases with common data elementdefinitions (metadata), the idea

created a new class of projectmanagement problems. Adding asecond application to a plannedcentral/corporate database thatwas already in operation wasn’ttoo difficult. If a field or file wasrequired in order to satisfy thesecond application, then so be it.Fields, and sometimes completelynew data files, were added ormodified to meet these newdemands. But as more and moreapplications were added, therewas a need for lots of traffic copsto keep all the changes fromscrewing up one another and,most importantly, to keep fromscrewing up all the applicationsalready in operation.

So what happened was a secondcycle of fragmentation. Projectmanagers began to clone existingdatabases rather than integratetheir database with all the others.This meant that the new applica-tion could be developed, tested,and installed without worryingabout knocking over the existinghouse of cards. In practice, itmade the project manager’s jobeasier, but in the end, it mademany other things worse.

Instead of a single set of customerdata, there were once again mul-tiple sets of customer data using,at least initially, mostly the samedata definitions (e.g., “customer.no,” “customer.name”). Onceagain, there were multiple copiesof the same data, and instead ofsharing common databases, datawas transferred from one applica-tion to another largely through

interface (data exchange) files.The more common the data, themore copies. The more copies,the more views of the truth.

By the end of this second cycleof fragmentation, which occurredtoward the end of the 1980s, acatalog of common data problemsexisted in nearly every large groupof computer users in the devel-oped world. To solve these prob-lems, a new approach wasinvented. The result was themove to a new data architecturefor informational data: datawarehousing.

In general, data warehousing proj-ects cost more and took longerthan most people expected. But inmost organizations, real headwaytook place, and today, most largeorganizations have made majorstrides with their data warehous-ing initiatives. But there is no rest-ing on past laurels; competitivepressures require even betterdata, especially data about ourmajor products and services andabout our major customers, ven-dors, etc. The new buzzword forprojects that build customer andproduct hubs is “master datamanagement” (MDM).

In a sense, MDM is a new namefor an old concept. Since the earli-est days of mainframe computers,systems developers and databasedesigners have been stumblingover the same basic truth: namely,that the most important data com-ponents correspond to the mostimportant business entities. This

©2007 CUTTER CONSORTIUM VOL. 7 NO. 1

EXECUTIVE REPORT 33

Page 6: Data Architecture

discovery provided the spark thathas triggered some of the mostimportant developments in thehistory of software.

Now, after nearly 20 years of datawarehousing — though significantimprovements have been made— a third cycle of fragmentationhas occurred within the industry.While some progressive organiza-tions have built unified data ware-housing solutions, a largernumber have opted to developquasi-independent data martsunder the cover of creating a datawarehousing architecture.

In this Executive Report, we firstexplore the fundamental conceptsbehind data architecture. We thenexamine some of the challengesthat data architecture faces andsome of the drivers and enablersthat are pushing it forward. Finally,we look at the solutions enter-prises are utilizing to help bettermanage their data.

FUNDAMENTAL IDEAS BEHINDDATA ARCHITECTURE

Before we begin discussing thebasic concepts, I think it is impor-tant to get some things straight.The first thing to note is that archi-tecture is not just drawing biggerdata models. Data architecture isabout all the data and informationthat the organization keeps. It ismostly about the care and feedingof data/information kept in elec-tronic form, but it includes the

data in whatever form is impor-tant to the organization.

Data architecture is not justrecords management for elec-tronic records, either. There issomething fundamentally differentabout data architecture in this eraof computers and communica-tion. Today, someone can easilymake off with the entire customermaster information on a smalllaptop. Information that oncefilled floor upon floor of file cabi-nets currently occupies only acouple of disk drives. We have atour disposal more computingpower than anyone could haveimagined a few decades ago, butwe also live in an age where aknowledgeable person armedwith a desire to do harm to ourorganization could wreak havoc.

Data architecture in most organi-zations is a new discipline. Itinvolves thinking broader andlonger than we have in the past. Itinvolves worrying more about thefuture, raising inconvenient ques-tions, and taking risks. Throughoutthe history of IS, thinking aboutdata has produced enormous ben-efits. No single tool has proved asvaluable over time as DBMSs, butwe are now in a period in whichunstructured data is rapidly over-whelming our ability to under-stand, model, and manage it, andour tools have not yet caught up.But more important than tools, weneed people with broad interests

and experience who can thinkseriously about the future.

Elsewhere I have written thatdatabases are not about tables,attributes, or relationships; ratherthey are about the real world. Adatabase, by its very essence, isa model of the real world, andtherefore, the quality of a data-base is, at its heart, a function ofhow well our electronic modelsmirror what’s going on outside.

Over the years, it has occurred tome repeatedly that the key to dataarchitecture is knowing what isimportant in our model of the realworld. A major systems applica-tion, for example, may includetwo tables or 300 tables, but onlyabout 10% of those tables are sig-nificant; the rest are typically refer-ence tables that we use to makesure our data is correct and con-sistent. But it is that 10% that rep-resents our core strategic dataand therefore our data architec-ture. Ultimately, coming up with areal-world data architecturecomes down to semantics —business semantics.

Business Semantics

I wrote an Executive Report enti-tled “Business Semantics,”2 whichdealt with what is perhaps themost important thing I’ve learnedin 30-plus years of software devel-opment — the critical role ofsemantics and ontology in IS anddata architecture.

VOL. 7, NO. 1 www.cutter.com

44 BUSINESS INTELLIGENCE ADVISORY SERVICE

2See my Business Intelligence Executive Report “Business Semantics” (Vol. 5, No. 7).

Page 7: Data Architecture

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 55

Semantics is the study of mean-ing, and ontology is the study ofwhat exists in the real world.While terms like semantics andontology may seem overly philo-sophical to today’s business-people and software developers,these subjects are as relevant nowas they were 2,000 years agowhen the Greeks first proposedthem.3 Data architecture, data-bases, data warehouses, andMDM are all about the real worldand the things (and classes ofthings) in the real world that affectour businesses. Enterprises keepdata about products and cus-tomers, because products arewhat the enterprise sells, and cus-tomers are the real-world entitiesthey sell them to. The same is truewith purchased products, vendors,jobs, services, and employees.

Too many people think of datadesign as largely a technical orformal activity, which it is partially,but first of all it is about the thingsin the real world — the actors,transactions, events, locations,roles, organizational structures,and relations.

Actors and Messages

In my work on systems and datadesign over the last 30 years, onething has increasingly stood out:all of the major characteristics ofbusiness IS are ultimately related

to a small class of categories(classes of entities) that have todo with the business context for agiven enterprise. One would liketo suggest that this insight wasrationally worked out top-downfrom first principles, but like mostreally important insights, this wasthe result of discovering a tool fordoing requirements that focusedon context. In the late 1970s, mycolleagues, clients, and I beganusing something we now call con-text diagrams as the starting pointfor doing requirements definition(see Figure 1).

We were looking to find someway to get groups of systemsusers to help us quickly under-stand a given systems domain.What we came up with wassomething we initially called anentity diagram (what we now call

a context diagram). On thesediagrams, there were only twoclasses (categories) of entities:(1) bubbles, which represented“actors,” and (2) arrows, whichrepresented “messages.” Contextdiagrams are stylized communica-tion diagrams. Actors communi-cate by sending and receivingmessages:

Actors are individuals,organizations, or systems.The distinguishing character-istic of actors is that theycan autonomously send andreceive messages.4 In thisenvironment, actors “act.” Theydo something; more specifi-cally, they communicate. As wewill see later in this report, keyactors are often the folks whoare most important to ourenterprise (e.g., customers,

invoice

payment

shipping

notice approved

order entered

order

order

delivered

equipment

billing

notice

Accounting

Warehouse

Sales

manager

Customer

Order

entry

Credit

manager

Figure 1 — Business context diagram.

3The earliest philosophers were concerned with the relationship between our thoughts, language, and the real world, but the first organized ontologywas propounded by Aristotle in the 4th century BC. In a series of seven books — historically called the Organon (works) — Aristotle set forth the firstdefinitive ontology and the first organized rules of logical and scientific reasoning. In the Organon, Aristotle defined three things that are now recog-nized to be part of any well-formed ontology: a set of categories (classes of entities), rules for forming logical sentences (propositions) using the cat-egories, and a set of inference rules for reasoning from a given set of propositions.

4Actors are very similar to agents in agent technology.

Page 8: Data Architecture

vendors, employees, andshareholders).

Messages are what arecommunicated. Messagescan be in any physical form:documents, packages, letters, e-mail, voice, or multimedia.In an IS context, key messagesare often the business transac-tions that make up the system(e.g., orders, shipments, bills,payments, and returns).

On the surface, context diagramsare elegantly simple, but they arealso amazingly effective. As faras we can tell, they have beenused for decades and have beenreinvented dozens of times.Context diagrams only showactors and messages, but we havefound that they also are excellentguides to uncovering the othersemantically important entitiesas well.

Objects and Subjects

Any context diagram is aboutsomething; that something iswhat we call the object or subject.The objects involved in businesssemantics are the objects ofthe communication that goes onbetween the major actors. Con-sider the case where our business(the enterprise) does businesswith a customer. Here, the objectis normally some kind of productor service. In modern terms, forlegal business exchange to takeplace, a number of messageshave to go back and forth (e.g.,order and shipment). What is

ultimately being exchanged arethe objects of the communication— in this case, product andmoney.

Objects can be something otherthan products or services. In thecase of buying or selling realestate, they are about parcelsof land and improvements (e.g.,houses and barns). In the case ofemployment, the object is a job orposition. And in the case of intel-lectual property, music, or publish-ing, the objects are likely to bepatents, copyrights, or trademarks.

Finally, objects can be people.When that is the case, we refer tothem as “subjects.” A few yearsago, I worked on the require-ments for a juvenile justice sys-tem, in which the juveniles wereboth an actor in the system andthe object of the system at thesame time. In corrections, wel-fare, medical, and any number ofcase systems, the object of thecommunication (and thereforethe system) are the subjectsthat the agency deals with or isresponsible for.

Events

In business semantics, the send-ing of a message is triggered by anevent, and the receiving of a mes-sage is signaled by yet anotherevent. Business events tend to fallinto two subclasses: periodic andaperiodic. Common periodicevents in the financial world aretriggered by the end of month,end of quarter, or end of year.

Aperiodic events might be a caraccident, death, birth, customerorder, or any other nonpredictablehappening.

Locations

It is important to know wherean actor or object is located orwhere a message should be sent;locations define physical or logicaladdresses. Physical locations areoften referred to as addresses orcoordinates. Historically, physicallocations have been thought of asrelatively static things, but with theadvent of technologies like GPSand wireless communications,more and more things (actors,objects, subjects) will have tran-sient addresses that tell youwhere they are right now (or atleast where they have beenrecently). Logical locations arethings like e-mail, Web addresses,or telephone numbers. Histor-ically, things like telephone num-bers indicated physical areas, butwith cell phones, modern switch-ing systems, and VoIP, unlessyou’re the service provider, youhave no way of knowing wheresomebody actually is.

Context

Part of what natural languagedoes well and computer lan-guages do poorly is take intoaccount the “context” we talkedabout previously. Because thereare more things in the world thanthere are words, human languagefrequently uses the same word orphrase to mean different things in

VOL. 7, NO. 1 www.cutter.com

66 BUSINESS INTELLIGENCE ADVISORY SERVICE

Page 9: Data Architecture

different contexts or uses differentwords to refer to the same thing(or person) in different contexts.Identifying context turns out tobe really important and difficult,since context can be both subtleand tricky. Ludwig Wittgensteinspent decades trying to tease outa common definition of “game”from all of the different ways thatthe term is used in normal lan-guage, and he was never reallysuccessful (which was his point,after all).

Roles

Perhaps the most significant topicof ongoing debate in businesssemantics today is the increasingimportance of “roles.” When wetalk about individuals and organi-zations by themselves, it is nor-mally clear what we’re talkingabout. However, when we refer to“customer” or “employee,” theseterms are highly dependent oncontext. For example, if we weretalking about an individual walk-ing around a department store,whether they are a “customer,”“employee,” or “security guard”depends on whether they werebuying products, selling products,or watching people. In businesssemantics, we begin by definingactors and segue into definingroles that actors play by the timewe are done developing businessprocess (workflow) diagrams.

Relationships

General relationships are impor-tant semantic categories as well.The most important things to peo-ple are their family, business, andsocial relationships. Organiza-tional structure is usually modeledas relationships. And wherehumans are good at reasoningbased on relationships (e.g., If Y isthe son of Z, and X is the daughterof Y, then … ), computers havenot been so good historically.

Business Exchanges, Processes, and Relationships

The final group of categories areextensions of the ones that havealready been covered. Businessexchanges are modeled by classicbartering transactions in whichboth actors get something.5 Intoday’s world, a complete busi-ness exchange often includes aseries of messages or transactions(e.g., order, shipment, bill, andpayment).

Major large business processes(also referred to as value chains)often relate to real businessexchanges (e.g., order tocollection, and requisition toacquisition).

Finally, business relationshipsrefer to all of the businessexchanges between two businesspartners over their entire history.6

Some Comments on BusinessSemantics

Every Journalism 101 student istaught to ask six basic questions:Who? What? Where? When?Why? How? These fundamentalquestions have to do with recogni-tion that the real world has only asmall number of important cate-gories that everything elserevolves around. The study ofbusiness semantics has reinforcedthe importance of these majorcategories: actors, messages,objects, events, and location. Inevery business application, thesemantic categories dominate theimportant physical data organiza-tion. Actors refer to customers,objects refer to products, andmessages refer to orders, ship-ments, bills, and payments. MDMis, after all, about the most funda-mental actors and objects.

Within enterprise data architec-ture, business semantics play anincreasingly important role. Whileno one can possibly understand allthe various tables and files thatexist in a large organization, theycan come to understand what themajor actors, messages (transac-tions), and objects are, and thoseare the highest-level categories thatthe data architect has to deal will.

The major entities then help us setboundaries.

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 77

5In legal terms, a contract between two parties (legal persons) is not binding unless both parties get some consideration. That’s why if you are usinga contract even to give something valuable to someone else, they have to pay you a nominal amount (e.g., $1) as consideration.

6While there has been a great deal of interest in customer relationship management (CRM) over the years, very few CRM systems actually attempt tobring together all of the interactions that the enterprise and the customer have had over the years. There is, however, much more interest is doingjust this in many of the customer hub implementations that are underway.

Page 10: Data Architecture

Attributes, Views, and Databases

The previous section discussedthe fundamental real-worldsemantic basis for data architec-ture. In this section, we discuss asmall set of fundamental issuesand confusions surrounding data-base management. One of themost fundamental confusions thathinders communication amongusers, database designers, anddevelopers is the distinctionbetween attributes, views, anddatabases. Over and over again,throughout the history of systemsdevelopment, this confusion haskept organizations from develop-ing sound databases and dataarchitectures.

Attributes, Collections, andIdentifiers

Attributes (data fields) are someelement of data that representssome property of some businesssemantic entity (e.g., “name,”“address,” and “age”). A derivedattribute is some computed orderived value created from someother piece of data in a given con-text. Records are a set of attributesbelonging to a common subject.7

Identifiers are attributes that iden-tify the subject of a record.

Views

A view is a structured (normallyhierarchical) set of data attributesthat a user sees (or wants to see)

in some report, screen, or otherrepresentation.8 In an ideal world,users would be able to define justthe views that they need in theirsystems absent any concern forwhat data attributes or databasestructure already exists, or con-versely use existing metadata toselect existing attributes and addany that don’t currently exist.

Databases

Essentially, a database is the logi-cal sum of all the views for a givenapplication domain. In an idealworld, the logical database wouldbe automatically derived from allthe views defined.

Data (Information) OrganizationIssues

Working down from businesssemantics and fundamental con-cepts, a data architect needs tounderstand the basic approachesto organizing data. Among themost popular approaches arenetwork, hierarchical, relational(tabular), object-oriented (OO),inverted file, multidimensional,XML, and spreadsheets.

Network Databases

Network-oriented databases wereone of the earliest approaches toorganizing large amounts of data.The first industry database stan-dard, for example — proposed bythe CODASYL Consortium — was

a network database. Over time,network databases lost out, first tohierarchical databases and then torelational databases. But networkdatabases have always excelled inflexibility, and the amazing growthand adaptability of the Internet isdue in large part to the fact that itis built on something very muchlike a simple network database.The problem with network data-bases is the difficulty in handlingupdates, especially deleted nodes,which is the Internet’s biggestproblem as well.

Hierarchical Databases

Hierarchical databases areanother popular method oforganizing data. If user views arehierarchical, the reasoning goes,why not store the data in thesame sequence as the views?IBM’s IMS is perhaps the mostfamous hierarchical database.Historically, hierarchical databaseshave always provided high perfor-mance in terms of transactionsper second. The problems withhierarchical organization haveto do with flexibility and with theability to support large numbers ofmultiple views at the same time.Both network and hierarchicaldatabases are sometimes referredto as navigational databasesbecause they require the user(i.e., the programmer) to navigatethrough the database to find infor-mation or create a view. Large

VOL. 7, NO. 1 www.cutter.com

88 BUSINESS INTELLIGENCE ADVISORY SERVICE

7Here, records refer to logical, not physical, sets of data.8A view can be a series of sounds (e.g., voicegram) or tactical (e.g., braille) but logically, the data structure remains the same.

Page 11: Data Architecture

numbers of network and hierar-chical databases exist today inlegacy applications, and hierarchi-cal organization has been resur-rected by the OO folks.

Relational Databases

Certain database researchersconsidered navigational databasesas merely a stepping stone to amore logical and mathematicalapproach for defining, updating,storing, and retrieving information.Edgar (Ted) Codd was the leaderof a group at IBM who came upwith what they considered a bet-ter way of database organization.In 1970, Codd published a land-mark paper entitled, “A RelationalModel of Data for Large SharedData Banks,” which set forth anumber of basic principles. Themost significant were:

(1) data independence fromhardware and storage imple-mentation and (2) automaticnavigation, or a high-level,nonprocedural language foraccessing data. Instead ofprocessing one record at atime, a programmer coulduse the language to specifysingle operations that wouldbe performed across theentire data set. [1]

Relational databases are made upof a number of tables, each ofwhich has some variable numberof rows (records) and is made upof the same number of columns(attributes). Associations betweentables are created through oper-ations on common attributes

(identifiers, also called keys). Theelegance of this constructionallowed the development of math-ematically provable operations ondata, as well as a mathematicalapproach to database design. Thedatabase approach was callednormalization, which providedrules for developing a databasefrom a set of views. Others havetaken the same idea to createapproaches for the automaticdesign of relational databases.

Relational databases began asresearch projects in the 1970s andbecame commercial tools in the1980s. By the mid-1990s, relationaldatabases had overcome a num-ber of performance and integrityproblems to become the domi-nant form of database orientation.The vast majority of large data-bases developed over the last 15years have been on relationaldatabases. This is still true today,though new database approachesare increasingly popular.

OO Databases

At about the same time that rela-tional databases were beginningto replace network and hierarchi-cal databases as the principalform of commercial database, OOprogramming was beginning togain popularity. Instead of beingbuilt on a simple tabular datamodel, OO programming anddesign is built on a set of object-class hierarchies. Object-classdefinitions contain not only databut also “methods” (behavior/

programs). The use of object datastructures has led the way todevelopment of much more com-plex data types (structures) that,in turn, have made it possible tooperate easily upon a wide varietyof complex sets of information(documents, sound, video, andmultimedia). Today, most contentmanagement systems employsome form of OO database.

While OO has become the pre-ferred programming approach,most large commercial databasesare still stored and maintained onone of the more popular relationaldatabase platforms (Oracle, IBMDB/2, Microsoft SQL Server,Informix, etc.). This has createdwhat is referred to as the “rela-tional-OO impedance mismatch.”

Though more complex, OO data-bases share both the advantagesand disadvantages of hierarchicaldatabases. As my colleague ArunMajumdar has phrased it, “OOdatabases are good vertically butnot so good horizontally.” Simplyput, objects work well where thedata is naturally hierarchical andnot so well where the naturalorganization is some form of net-work or multiple simultaneoushierarchies. In something of a stepbackward, OO databases are verynavigational in nature, whichmakes it difficult to build a com-mon query language that willoperate over large numbers of dif-ferent objects as SQL can do overhundreds or thousands of tablesin a very large database.

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 99

Page 12: Data Architecture

Inverted-File Databases

Inverted-file databases were cre-ated to deal with unstructureddocument-oriented data. Inverted-file databases operate by: (1)indexing every word in everysentence of every document; (2)excluding noise words (a, and, or,the, etc.); (3) sorting all the wordreferences alphabetically; andthen (4) querying data by lookingup words and phrases. This lasttrick is accomplished by applyinglogical operations against thoseindexed words (or commonphrases) to retrieve all the appro-priate sentences and documentswhere those words or phrases areused. Inverted-file technology isstill a base technology of all thepopular search engines.9

Multidimensional Databases

Historically, databases used forend-user (analytical) applicationswere largely hierarchical. In the1980s, a number of multidimen-sional databases became popularto support the need to be able toquickly analyze and display man-agement information. These data-bases were built around the abilityto store and quickly retrieve dataorganized in multiple hierarchicaldimensions at the same time(e.g., geographic region, productfamily, time). Such systems, whichusually were built around propri-etary data structures, made it

possible for end users to instanta-neously drill down or collapselayers in these hierarchies.Conceptually, multidimensionaldatabases can be thought of asvery large matrices.

While multidimensional data-bases are popular with end users,they take a lot of time to build.Hardware and software advanceshave hidden much of backgroundwork, and large numbers ofdimensions are still a problem. Asthe number of hierarchical dimen-sions increases arithmetically, thesize of the matrices increases geo-metrically (i.e., exponentially).

Another approach for creatingmultidimensional databases isRalph Kimball’s star schemaapproach to data warehousing.In this case, a multidimensionalstructure is represented in a rela-tional database with a central facttable and a series of connecteddimensional tables. Here, rela-tional technology is used to imple-ment a multidimensional view.

XML

One of the major developmentsin data organization in the lastdecade has been the increasingimportance of XML as a dataexchange mechanism. XML is anoutgrowth of the SGML markuplanguage, which emerged in thelate 1990s as a way to produce

data structures that were bothhuman- and machine-readable.XML has become increasinglypopular as a way to communicatecomplex data structures (docu-ments, images, video, etc.) andas the basis for informationexchange in a number ofdomains. Like OODBMSs, XML isspawning XML databases that mir-ror the structure of the XML that isprocessed.

Spreadsheets

Any discussion of database organi-zation strategies would be incom-plete without a discussion ofspreadsheets. While databaseadministrators and most dataarchitects disdain to discussspreadsheets, they are by far themost widely used form of dataorganization among non-IT profes-sionals. In finance and any num-ber of other domains, mission-critical spreadsheets containmuch of most enterprises’ criticaldata. Spreadsheets make it possi-ble for people with limited under-standing of programming or dataprocessing to compute complexfunctions and to manipulate andanalyze complex “what if ” scenar-ios. In recent years, Microsoft hasmade Excel the principal userinterface to tabular data. Becauseof the ubiquity of Excel, otherdatabase and BI vendors havemade Excel their preferred userinterface.

VOL. 7, NO. 1 www.cutter.com

1100 BUSINESS INTELLIGENCE ADVISORY SERVICE

9The most important addition to basic inverted-file technology for search engines has been the “page-rank strategy” developed by the founders ofGoogle — Sergey Brin and Lawrence Page. This algorithm was an outgrowth of work by information scientists working in the field of research publi-cation. These researchers discovered that one of the best ways to rank the importance of a specific scholarly publication was to determine howmany times that publication was cited in other scholarly papers. The page-rank algorithm essentially refined the same approach to rank Web sites.

Page 13: Data Architecture

Persistence and Transience (Where Do Object-Class andXML Models Fit?)

One interesting development ofthe last decade has been the influ-ence of OO thinking with respectto databases. One of the thingsthat differentiated OO from previ-ous programming paradigms wasthat OO focused on creatingsophisticated objects and object-class hierarchies to simplify pro-gramming and that, historically,those objects have lived in tran-sient memory rather than onsome sort of persistent storagemedium like a hard disk.10 By andlarge, data architecture deals withpersistent data, not transient data,but with the importance of object-class models, some provision hasto be made to incorporate thesemodels into the overall dataarchitecture.

Some Comments on DatabaseOrganization

Besides networks, hierarchical,relational, OO, inverted-file, andmultidimensional organizationstructures, there are a number ofother approaches as well. Eachmethod of data organization hasadvantages and disadvantages.Indeed, all of the major commer-cial databases utilize, under thecovers at any rate, most of themajor organizational strategies.11

And, for historical reasons, a greatmany of these organizational

schemes are found in most largeorganizations. So data architectsneed to be familiar with all(or nearly all) of the majorapproaches. To assume that oneparticular form of database orga-nization is, or will ever be, thede facto corporate standard for-ever is just wishful thinking.

Data architecture, like the archi-tecture of any large city, is acomplex, ongoing, historicalhappening. Old patterns continueto exist alongside new ones usedfor new classes of applications.And old (classical) organizationapproaches are reborn with newnames and slick technologicalimplementations. The trick todayis to be able to combine these var-ious organizational strategies with-out introducing more complexitythan already exists.

Distributed Database

A distributed database is a data-base that is under the control of acentral DBMS in which storagedevices are not all attached to acommon CPU. It may be stored inmultiple computers located in thesame physical location or may bedispersed over a network of inter-connected computers.

Collections of data (e.g., in a data-base) can be distributed acrossmultiple physical locations. A dis-tributed database is distributed into

separate partitions/fragments. Eachpartition/fragment of a distributeddatabase may be replicated.

Besides distributed database repli-cation and fragmentation, thereare many other distributed data-base design technologies — forexample, local autonomy, syn-chronous, and asynchronous dis-tributed database technologies.These technologies’ implementa-tion can and does depend on theneeds of the business and thesensitivity/confidentiality of thedata to be stored in the database.Hence, the price the business iswilling to spend on ensuring datasecurity, consistency and integrity.

One of the most important con-cepts a real data architect or datamodeler needs to understand is“distributed database.” In textbooks, distributed database has todo with databases with plannedredundancy of data distributedover multiple servers, normally indifferent physical locations. Inpractice, distributed databaseapplies to all copies of the same(or similar) data stored anywherein the enterprise.

Historically, database designersconsidered any form of redun-dancy a bad idea because thedata is often not updated cor-rectly. But redundancy is oftennecessary, even desirable,especially where availability or

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 1111

10OO databases have been around for at least three decades, but only recently have they begun to gain much traction. This means, by and large, thatOO developers have to map their in-memory objects onto relational tables for storage and retrieval.

11For the sake of brevity, list structures were omitted from the set of major approaches to data organizations. Lists are, in fact, a very popular way oforganizing data in a variety of application domains.

Page 14: Data Architecture

performance are significant prob-lems. As it turns out, redundantdata is not bad from either a data-base or data architecture stand-point. It is inconsistency that isthe real enemy!

If one could guarantee that thenetwork and all of the servers thatcontained portions of very largedatabases would always andinstantly be available, then distrib-uted database design wouldn’t bea problem. However, in the realworld, at any one time, variousnetwork links may not be avail-able, and one or more of theservers may be (or need to be)offline.

As stated, the principle problemwith distributed database is updat-ing. If you have redundant data,how do you keep it up to date?If the data doesn’t have to bealways in sync, then batch proc-essing will work. In data ware-housing, even today, batchupdating of the data warehousefrom existing databases is themost common approach.

In the 1970s and 1980s, databaseresearchers moved to solve theproblem of distributed databaseupdating. The basic problem,simply stated, is the following:if the same data (tables, rows,columns) exists in two databases,how do you apply the updatingso that at the end of the updatecycle, all of redundant data in allof the databases is identical?

Two major solutions were cre-ated: two-phase commit and repli-cation. The two-phase commitinvolves the server that initiatesthe updating of its local data andthen sends independent updatecommands to each of the otherservers on which distributed dataresides. Each of these remoteservers, in turn, updates its localdata and sends a “successfulupdate message” back to the orig-inal server. If all of the distributedsites communicate back successwithin a specific period of time,then the originating site sends outa “commit message,” and all theservers update their data. If one ormore of the distributed sites failsto send its “successful update”message back in the allotted time,then the originating site sends outa “rollback message” to each ofthe distributed sites.

Two-phase commit ensures thatall of the persistent data in a sys-tem, however large, is always con-sistent; because of this, it is thegold standard for distributedupdating. Unfortunately, two-phasecommit assumes that the devicescontaining data are nearly alwaysconnected to a highly reliablenetwork. As a result, two-phasecommit is not widely used for dis-tributed applications where thenetwork is unreliable or where thevarious servers (or laptops) maybe offline for significant periods.

Today, an increasing number oforganizations are providing their

workforce with portable comput-ers (laptops, PDAs, cell phones),which, even with wireless capabil-ity, are often offline for significantperiods. Such systems rely not ontwo-phase commit but on a formof data replication. Usually thisapproach works best where thereis a central database server andmany remote devices with a rela-tively small amount of shared dataon the portable device. Thisapproach allows users to workoffline, update the data on theirportable machine, and then syn-chronize with the central sitewhen they next connect.12

Straight Through Processing

If one looks at data architecturefrom an enterprise standpoint, themost common way to replicatedata is batch updating (i.e., copy-ing the data from one system andthen using it to update commonand redundant data on the secondsystem). Historically, batch proc-essing was the most commonform of serial intersystem commu-nication. In most large organiza-tions, data exchange to moveinformation from one system toanother was accomplished eitherby batch processing or reenteringdata.

In certain application domains(e.g., financial trading and pay-ment areas), this has led to unac-ceptable delays. Because of thepress of e-business, for example,there has been a strong push to

VOL. 7, NO. 1 www.cutter.com

1122 BUSINESS INTELLIGENCE ADVISORY SERVICE

12Microsoft’s version of this is called “Smart Client” technology, which includes talk about “optimistic” and “pessimistic” locking, but each one hassome serious drawbacks in systems with very high transaction rates.

Page 15: Data Architecture

process trades and paymentswithin a single business day. Thisprocess has come to be knownin the financial world as straightthrough processing (STP). STP isalso talked about with increasingfrequency in data warehousingas well to mean near-real-timeupdating of data warehouse data,which historically has been onlonger weekly or monthly cycles.

STP represents one of the mostconcerted steps in any industry tomove to near-real-time updatingof massively distributed data. Mostorganizations today still have hugenumbers of systems connectedlargely via copies of transactionalbatch data that is used in amaz-ingly complex threads throughouttheir organization. Over time, theexpectation is that the drive forfaster and faster response timewill, in turn, drive more and moreapplication areas (e.g., supplychain) to some form of STP.

Search and PresentationTechnology

It would be a major mistake toignore the impact that the Internethas had on data architecture inmajor organizations. Before theadvent of the Internet, most largeorganizations had private net-works and/or operated over net-works shared with other firms intheir own industry. The Internetchanged all that. Today, nearlyevery major enterprise in theindustrialized world is connectedto the Internet. Some — for pri-vacy, confidentiality, and security

reasons — restrict or prohibitdirect connection, but this is morethe exception than the rule.

The Internet is now the largest,most used database in the world,and Google is the most usedsearch engine. Almost single-handedly, Google has radicallytransformed the way people thinkabout finding information. Indeveloping a report like this, Isuspect that I used Google at least20 or 30 times, maybe more. Iused it to find definitions, papers,links to other research, and so on.I also use Google Desktop tocross-reference my internal filesand, most importantly, to find spe-cific e-mail and attachments. Ican’t imagine what I would do ifGoogle suddenly wasn’t available.What I do know is that it wouldcost both me and those who payme to do research a great dealmore than it currently does.

Google, as well as other searchengines, is now reaching intocorporate information retrieval.To date, this is mostly used to lookup documents and files. But nodoubt, advanced search enginesin the near future will be able tolook up and format almost anykind of information. Already thereis Google Scholar, Google Patent,Google Finance, as well as GoogleAnalytics. Organizations that havemade their living providing infor-mation to selected markets willincreasingly find that they mustcompete or cooperate with thelikes of Google, Yahoo!, orMicrosoft.

Search and presentation technolo-gies are revolutionizing nearlyevery field. First, we had map,weather, and restaurant informa-tion on different sites, and nowwe have intelligent mapping ser-vices where all the informationexists on a map or globe. Already,with wireless GPS devices, com-puters can show us the informa-tion we’re looking for using ourcurrent location as a referencepoint. High-speed Internet con-nections coupled with newtechnologies for visualizing infor-mation are having profound impli-cations on corporate (enterprise)systems. Increasingly, the expecta-tions of the public, business part-ners, and corporate users aregrowing, fueled by what they seeevery day for free on the Internet.

CURRENT PROBLEMS WITHDATA/INFORMATION IN LARGE ORGANIZATIONS

What, then, are the principal dataproblems that organizations facein the early 21st century at the endof this third cycle?

Data quality. One of the dirtylittle secrets in large enterpriseshas to do with problems recon-ciling data. At month’s end, forexample, the general ledgersystem may show one set ofnumbers for gross sales andnet profit, while the sales ordersystem often shows signifi-cantly different numbers.Getting the official numbers tobalance requires hundreds ofperson-hours each quarter,and, even then, many of the

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 1133

Page 16: Data Architecture

changes required to makeeverything work out are sus-pect. Data quality is an ongoingproblem that often underminesefforts like data warehousinginitiatives. In addition, experi-ence teaches that data quality,even within a single data silo,can be problematic. With theemergence of SOX and otherrules and regulations aroundthe world, the ongoing prob-lems of data quality loom large.

Integration. The next most seri-ous ongoing data problem hasto do with bringing togethercommon data from one appli-cation with that from another.13

Both operations and marketingorganizations often keep mas-sive customer files, but therealways seem to be problemsgetting the data to match up.The same is true with productinformation. There are productfiles in the sales and marketingsystems, and there are productfiles in manufacturing systems,and yet other product files inengineering systems. But thereis no one place where some-one can look at all the productdata. In multidivisional organi-zations, these integration prob-lems are magnified.

Getting at critical data. Mostorganizations have a polyglot ofdifferent legacy and COTS sys-tems that have to be interfacedto support enterprise data

processes. Many of these sys-tems, while efficient at captur-ing and updating their internaldatabases, are not nearly asgood at allowing easy accessfor new querying and reportingpurposes. Often, legacy data-bases lack adequate, up-to-date metadata that makes easyaccess possible.

Intra- and inter-enterprisesharing of data. Modern busi-nesses depend increasingly onelectronic supply chains andoutsourcing. This means thatdata must be shared acrossnot just departmental and divi-sional boundaries but alsoacross enterprises. The post-9/11 analysis that intelligenceand law enforcement agencieshad trouble sharing data wasnot a surprise to anyone whounderstood the technical, orga-nizational, and legal difficultiesin setting up robust data shar-ing mechanisms.

Recovery of historical(archived) data. Never has therate of technological changebeen greater. New generationsof technology occur every threeor four years today rather thanevery 10 or 15 years as it wasjust a couple of decades ago.When organizations arerequired to dig back 20 or 30years, for example, to satisfysome class action lawsuit, theyoften find that they no longer

have any equipment that caneven read the media on whichthe historical data is stored.Moreover, the older the data,the less likely that adequatemetadata exists that describesthe data files of databases.

Conceptual shifts. Not only arethere problems that result fromnew generations of hardwareand software, there are alsoshifts in systems and databaseparadigms as well. In databasetechnology, the first successfuldatabases were either hierar-chical or network-based.During the 1980s and 1990s,relational databases came todominate database strategy,and in the last decade, OOprogramming has created animpedance mismatch betweenprogramming and databasemanagement. Even recentdevelopments like XML makeit possible to send, receive,and store much more complexdata/information artifacts. How-ever, they do not match up wellwith persistent relational data-base implementations.

Even with all these problems,things are not hopeless. As plan-ning gurus are apt to point out,every problem is an opportunity.The fact that there are so manyproblems with data today is inlarge part the result of the successof computers and communicationsystems in transforming our

VOL. 7, NO. 1 www.cutter.com

1144 BUSINESS INTELLIGENCE ADVISORY SERVICE

13It is clear that data quality, integration, and common metadata are interrelated. If you have multiple copies of the same thing, you will haveproblems integrating the data. The organizations that use the separate copies will develop “dialects” where “customer,” “product,” etc., will meandifferent things.

Page 17: Data Architecture

organizations and changing theway business is conducted aroundthe world. The world is increas-ingly data-centric.

Today, millions upon millions ofpeople and enterprises are capa-ble of instantaneous worldwidecommunication. The Internet,which has created and/or revolu-tionized entire industries, is only alittle over a dozen years old. Dataneeds have mushroomed becauseorganizations have movedincreasingly from transportingatoms (physical things) to trans-porting bits (electronic pulses).However, if organizations aregoing to be able to leverage thesenew advances, they will have tomanage their data to be able tomeet current and future needs.

DATA/INFORMATION FORCES

To go along with the problemsand opportunities described previ-ously, there is also a series offorces driving and enablingchanges in our data architecture.For the purposes of this report, weclassify business needs as driversand technology opportunities asenablers. Some of these, as youwill see, are attempts to solve theproblems discussed earlier.

Business Drivers (Market Pull)

The multitude of things that topmanagers and professionalscontinually ask for include:

An integrated view of majordata classes. Some of thequestions that one would

assume are the easiest toanswer — such as, “How muchdid we sell last month by prod-uct and customer?”; “Howmuch business did our top 10customers do with us lastyear?”; and “How profitable areour top 10 customers?” — turnout to be exceedingly difficultto answer in most big organiza-tions. Getting integrated data islike pulling teeth. And as ourdata architecture becomesmore baroque, integrating simi-lar data becomes ever moredifficult.

The need to get at key datafaster. The world in Thomas L.Friedman’s The World Is Flat ismoving faster and faster. Manymajor decisions have to bemade today or next week, notnext month or next year. Oilprices spike, oil prices plum-met. There is a strike in Italy.What do you do? The center ofthe world’s economic attentionis no longer North America andEurope, but China, India, LatinAmerica, or somewhere else inthe developing world. In orderto be competitive in this world,information has to flow fasterfrom operations to corporateheadquarters, and it mustincorporate external datasources.

The need to handle largeamounts of structured andcomplex unstructured data.Today, large organizations arerequiring their vendors to beclosely tied to their supply

chain system. Large retail cus-tomers want to be able to sup-ply their vendors with detaildata on every transaction thatincludes one of their productsin every store in the chain. Andwith radio frequency identifica-tion (RFID), the amount ofdetail data will continue todouble or triple every couple ofyears. It won’t just be transac-tion data that will be captured;in some locations, video cam-era and recording system ven-dors are now supplying retailstores with software that trackscustomers’ movement withinstores. The original purpose ofthese systems was to reduceshoplifting and pilfering, butnow smart video tracking soft-ware is also providing vastamounts of information fordata mining of customerbehavior as well.

Immediate updating of alldata. In order to make up-to-the-second decisions, transac-tions must be able to updatenot only operational databasesbut management/analyticalones as well (STP).

Technology Enablers

As fast as the business worldchanges, the technology worldchanges even faster. Each day,new discoveries, products, andprocesses are announced. Each ofthese products enables new kindsof business solutions. No organiza-tion can take advantage of all ofthese innovations; the trick is tofigure out which technologies to

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 1155

Page 18: Data Architecture

buy into and when to make themove. Move too quickly, and thetechnology will not be ready; waittoo long, and the window ofopportunity may have closed. Inthe world of data, key technolo-gies include:

The cost per byte of datastorage. Probably no area intechnology has tracked orexceeded the predictions ofMoore’s Law better than datastorage. Just recently, for exam-ple, Hitachi announced a new1 terabyte disk. It was not solong ago that 1 terabyte wasconsidered a very large data-base. Indeed, even today, manyorganizations could put all oftheir structured data for mul-tiple years on such a drive. Buteven with all this capacity,organizations are pressed tocapture, index, and retrieve allthe important information attheir disposal.

The speed of data retrieval.Not only is data storage becom-ing cheaper every few months,but the speed is also increasing.The dominant form of storage,rotating magnetic memory,stands to be replaced for agreat many applications bylarge flash memories, whichrequire very low power require-ments and are nonvolatile, thuseliminating many problems ofbackup and recovery.

Search technology. A recentbusiness consultant talkingabout Google’s great successdescribed using Google as akin

to “sitting next to the smartestkid in school.” Google hasmade the access of billions ofdocuments, images, and Websites an amazing exercise,which increases the IQ of theInternet user by quite a lot. Asa result, Google is moving toprovide the same subsecond,highly relevant access to inter-nal enterprise data.

Grid computing. Google worksits magic not by using large,powerful server farms, but byusing hundreds of thousands(millions) of very simple com-puters in centers around theworld, running special grid-based indexing, organizing,and search algorithms.

Data mining. Data mining is atechnology that provides soft-ware algorithms for analyzingmillions of records and discov-ering often-unexpected pat-terns. This technology is beingutilized in industries from retailto telecommunications to drugdiscovery.

Data visualization. Data visual-ization provides new ways toutilize human intelligence toliterally see unexpected pat-terns in 3D images. As theamount of information thatpeople have to work withincreases, the more need theyhave for ways to consolidateand visualize their data in new,intuitive ways.

Semantic modeling. Long-term, the most important newtechnology affecting data is

to create truly semantic datamodels, which include ontolo-gies, propositions, and infer-ence rules.

There is one caveat: all of thistechnology works best if the data(and the metadata) is good, up-to-date, and someone understandswhat it means. Today, seriousproblems occur because nohuman ever looks at the detaildata due to sheer volume. Anddespite decades of discussionabout artificial intelligence, mostcomputer algorithms only find thethings that they are looking for.Technological breakthroughs thatwill actually allow computers tounderstand and reason more likehuman beings are getting closerbut still need a lot of humanintervention.

The Netherworld of Legacy Data

Understanding the business dri-vers and technology enablers thatpropel our organizations into thefuture is not even half of the chal-lenge large organizations face indeveloping (and maintaining) aneffective data architecture. Whatholds them back is a very largeanchor in the form of the existinglegacy data environment, wherelegacy data is defined as any datain any system that is in operation.Table 1 gives just a small windowinto the nature of the problem bydisplaying the basic database sta-tistics for a middle-sized govern-ment enterprise with an annualbudget of about US $1.2 billionand about 3,000 employees. Theorganization has been involved in

VOL. 7, NO. 1 www.cutter.com

1166 BUSINESS INTELLIGENCE ADVISORY SERVICE

Page 19: Data Architecture

computing since the late 1950sand uses computers in everyaspect of its organization, fromadvanced engineering and plan-ning through sophisticated projectmanagement. An early user ofDBMSs, the agency has largeinvestments in both mainframe(DB2) and server (Oracle) data-bases. These numbers, by theway, do not include instances ofMicrosoft SQL Server, MicrosoftAccess, and Excel spreadsheets,some of which also contain criti-cal agency information.

Now, this is not an extreme case.The agency in question has a well-run IT organization with an experi-enced DBA group. Indeed, thisorganization has been involved inserious systems and databaseplanning and architecture fornearly two decades. More impor-tantly, it is not a large organizationas enterprises go. What stands outis just how large the numbers oftables and attributes are.

The technical database group andsenior developers in this organiza-tion know how integrated data-base systems should be designed,developed, and deployed. Butover time, the push to deliver newfunctions in shorter and shorterperiods has encouraged projectteams to develop their own data-base with similar (or the same)tables with lots of attributes incommon.

From my experience over the lasttwo decades, this organization isnot unique. Indeed, my firsthandknowledge would suggest thatmany attributes, even manytables, that exist in a normal datacenter are not used, or not used inan appropriate way. One highlycomplex COTS application that Ireviewed had something like 200tables in its database. However,on examination, it turned out thatalmost half of the tables (94) hadno rows (i.e., no data) in them! Inaddition, of the 20 or so majortables in the application, 50%-66%of the columns had either nulls orconstants in them, meaning thatthey weren’t being used either!

In this report, we focus on whatthe industry is now calling “mas-ter data” (customer, product, ven-dor, employee). Here, despitedecades of concentration, theproblem is even more difficult.Customer data is probably theworst case. Every commercialenterprise (and many publicones) focuses extensively on itscustomer data, but in organizationafter organization, it turns outthere are multiple customer data-bases. This in and of itself would

not be too bad, but in a number ofcases, these databases don’t haveeither a common scheme foridentifying their customers or anagreement from system to systemon what actually constitutes acustomer.14

One of the reasons that thishappens is that the needs of oper-ational divisions and the enter-prise are quite different. From thestandpoint of an individual divi-sion or department, its underlyingdatabases do not have to be con-sistent or compatible across theenterprise, they just have to workfor that department or division. Aslong as all the bills get out andsomeone pays them, the billingdepartment for Division A doesn’tcare if there are many duplicatedcustomer numbers, for example,but at the enterprise marketinglevel, duplicate customer num-bers matter quite a lot — it isimportant to know who your bestcustomers really are and howmuch they buy. And as enterprisesgear up for operations in the 21stcentury, getting a handle on dataquality across the enterprisebecomes a major need.

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 1177

DB2 Oracle Total

Databases 14 42 56

Tables 560 3,700 4,260

Columns (Attributes) 80,000 17,600 97,600

Table 1 — Database, Tables, and Attributes

14One large manufacturing company that I worked with in the 1990s had two major (and lots of minor) customer databases. The one maintained byoperations had some 500,000 customers on it, while the marketing organization had a customer database with more than 4 million customers!

Page 20: Data Architecture

All the business strategies and allthe technology in the world willnot solve problems of legacy sys-tems and legacy databases.Replacing some of the majorapplications with COTS packageswill not solve the problem and inmany cases may make it worse.15

What is required to actually makea dent in this data problem is avery long-term migration strategyguided by an informed, evolvingdata architecture together with aprogram to separate operationaldata systems from analytical/managerial ones!

DATA ARCHITECTURESOLUTIONS

I have written elsewhere thatenterprise architecture is morelike urban planning than it is likebuilding architecture. While con-structing buildings is a complextask and an interesting one, urbanplanning deals with much morecomplex interrelated factors andwith many more dimensions. Thisway of looking at EA is beginningto get a wider following withmajor organizations like Microsoftand Walt Disney promoting urbanplanning models [2].

If viewed as urban planning, EAcan be thought of as dependingupon a detailed understanding(documentation) of the current“as built” environment, as wellas the collaborative developmentof master plans, rules, and regula-tions to help fashion (regulate)

the future. Like cities and suburbsin urban areas, application anddata architectures grow organi-cally over time, and despite thebest-laid plans or well-fundedinfrastructure, unplanned thingshappen.

In the same way that all citieshave an as-is urban architecture,so too, it can be argued, all orga-nizations have an as-is dataarchitecture even if they don’trecognize it or consciously man-age it. The sum of an organiza-tion’s legacy data provides its as-isdata architecture. Unfortunately,the problem with unplanned dataarchitectures is that they oftenlead to many of the problems thatwe discussed earlier.

Distributed Database

… a data architecturedescribes: (i) how data ispersistently stored, (ii) howcomponents and processesreference and manipulatethis data, (iii) how external/legacy systems access thedata interfaces to data man-aged by external/legacy sys-tem, (iv) implementation ofcommon data operations. [5]

To recapitulate: the as-is dataarchitecture for an enterprise isthe sum of all the data practicesof the enterprise up to the currenttime. It represents both plannedand unplanned activities; it repre-sents internal and external data;and it represents large databasesand small ones, enterprise-wide

projects, as well as local, depart-mental ones. It even representsthose unauthorized access data-bases and spreadsheets that aresomewhat similar in and enter-prise data architecture to theshanties and makeshift structuresthat dot the urban landscapes. Aserious enterprise data architecthas to assume that everything thatexists has value to someone in theorganization. All of my experienceshows that an organization’s trueas-is data architecture is enor-mously more complex than any-one suspects, even those involvedin database management.

And the as-is data architecture isnot restricted to just its so-calledpersistent data either; it also mustinclude all the data exchangesthat move from system to system.These data exchanges range fromtransferring physical media liketapes and disks to nearly instanta-neously electronic communica-tion. While it was assumed in the1970s and 1980s that all systemswould be essentially online in afew years, batch approaches havepersisted even for online systemsprocessing millions of transactionsper day as a means of updatingdata in other systems. This serialupdating process has turned outto be easy to understand, explain,and manage. What is not so easyto explain and manage is the tim-ing of updating and exactly whichdata is updated and when. Mostexisting systems have many

VOL. 7, NO. 1 www.cutter.com

1188 BUSINESS INTELLIGENCE ADVISORY SERVICE

15COTS packages have their own unique databases, attribute names, and business rules. Most of these packages, for some very good reasons, do notwant anyone messing with their database structure or data name. In a great many cases, highly integrated COTS applications have proven difficultto extract data from and equally difficult to interface with the hundreds of existing legacy and other COTS applications found in the organization.

Page 21: Data Architecture

versions of “the truth,” dependingon when you ask, what your defi-nitions are, and, of course, whoyou ask. Quasi-independentdatabases tend to have differentvalues for common information,so one of the driving factors in alldata architecture efforts has to beto improve the overall quality ofand reliability of the enterprise’score data.

Data Quality

The quality of data is a key (evencritical) issue in every large orga-nization. At base, if an organiza-tion’s information isn’t any good,the company will make serious,often fatal, mistakes. Ultimately,data quality is a pretty simple con-cept. Data quality is a measure ofhow closely the data in our data-bases agrees with the real world.A data quality value of “1.0,” forexample, would mean that ourdata is in exact agreement withthe same values in the real world.All of our addresses for all of ourcustomers would be correct, aswould all of their telephone num-bers. On the other hand, a dataquality value of “0.0” would meanthat there is no agreement at allbetween the data in our data-bases and the real world.

Obviously, time is an issue. If ittakes a month to get a name andaddress updated, then it is likelythat the names and addresses willbe at least a month out of date. Ifthere is not a systematic way ofgetting data updated once it is

entered, then databases will beless and less accurate.16 And thereare a large number of other dataquality issues that companiesstruggle with. Often, an enter-prise’s customers will have morethan one customer number in onesystem. Indeed, business promo-tions often encourage cheating.For instance, I’ve been told that ifI apply for a store credit card, Iwill get an extra 10% or 15% off.When I tell the salesperson that Ialready have one of their cards,they often say, “Well, that’s allright, they’ll [the central IT folks]never know. Just use your initialsinstead of your full name.” So notonly do customers end up withmultiple customer numbers, butthe best customers end up withthe most.

Ultimately, the most importantindicator of data quality is use.The more an organization uses itsdata, the better the data will get.Conversely, data that is rarely ornever used will not (cannot) bevery good. Data warehousing hasdramatically improved data inmany organizations, though itoften takes a while.

Data Warehousing

A data warehouse is a subject-oriented, integrated,time-variant, and nonvolatilecollection of data in supportof management’s decision-making process.

— Bill Inmon, “Father of Data Warehousing”

Now, the complexity created bythe drivers, enablers, and legacydata world described earlier is notnew to IT. Indeed, these problemshave been around for decades.And the larger and more progres-sive the organization, the earlierthis complexity is encountered. Bythe middle of the 1980s, a numberof large data-intensive enterprisesaround the world had alreadyreached a point where their datapain threshold had beenexceeded. A new generation ofrelational DBMSs was coming on-stream with the promise of amuch more elegant way of creat-ing, updating, and retrieving data.

But these organizations alreadyhad hundreds (and in somecases, thousands) of applicationswith existing data files or first-generation DBMSs. A number oforganizations had extensive expe-rience with the explosion of data-bases occasioned by the lack of aclear strategy for integrating dataand providing a single view of thetruth. Analytical tools were com-ing on-stream that could providemanagers and professionals withmuch improved reporting andvisualization capabilities — if onlythey could get at the needed data.

Many of these organizations beganto look at the wholesale replace-ment of their existing legacydatabases with true centralized,integrated databases. However,when the IT planners put pencil topaper, they found that it would be

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 1199

16Demographers estimate that approximately 16% of the population of the US changes their address every year.

Page 22: Data Architecture

10, maybe 20, years before thisintegration could be affected(remember this point).

What these planners came upwith was a different strategy, notjust a technology strategy — adata architecture strategy knowntoday as data warehousing. Thequestion that touched off this rev-olution was, “What if we developan integrated analytical databasewhere we could bring all ourmajor operational data togetherand then use that database for ourreporting, analysis, and simulationwork?” These planners felt thatalthough building data ware-houses might take three to fiveyears, it would provide much ofthe value of total data integrationand data consistency without dis-rupting the operational systemsthat had become essential to the

day-to-day operations of theirbusinesses.

This splitting of the reportingand analysis applications awayfrom the operational domaincreated a new high-level enter-prise data flow architecture thatacknowledged the difference ofoperational systems from whatmany now call “informational sys-tems” (see Figure 2). The term“informational” was coined todistinguish operational systemshandling the day-to-day businessactivities from those systems asso-ciated with planning, forecasting,and control. While EA was emerg-ing at about the same time, it wasdata architecture where one ofthe first major enterprise architec-tural solutions was first put intooperation.17

The next stage was to furtherdivide the IS domain into datawarehouses (or operational datastores) and data marts to allowend users to directly access highlystructured informational data-bases (see Figure 3).

As it turned out, performing datawarehousing was much harderthan anyone anticipated. Inte-grating data from disparatesources became much moredifficult, expensive, and time-consuming than its early advo-cates suspected. One the prob-lems had to do with the beststrategy for designing a data ware-housing database; the other hadto do with the quality of the dataand metadata of the myriad oper-ational databases.

Two major data warehousingdesign strategies emerged during

VOL. 7, NO. 1 www.cutter.com

2200 BUSINESS INTELLIGENCE ADVISORY SERVICE

Business

Intelligence (BI)

Applications

Data Warehouse

Population

Application

Operational

Applications Data

Warehouse Operational

DBs

Operational Systems

Figure 2 — Early data warehouse architecture (circa 1990).

Business

Intelligence (BI)

Applications

ETL

Applications

Operational

Applications Operational

DBs

Operational

Data Store (ODS)

Data Mart

Population

Applications

Operational

DBs

Figure 3 — Mature data warehouse architecture (circa 1995).

17The other major enterprise initiative that occurred around this time, “business process engineering,” was also beginning to have a major effect onboth business organization coupled with technology advances in workflow management.

Page 23: Data Architecture

this period: a detail, near-normal-ized data warehouse–based solu-tion (the ODS [operational datastore] approach18) and a dimen-sional solution (the star schemaapproach19). As a practical matter,the ODS approach tended to leadto larger, more centralized datawarehouses; whereas the starschema approach tended to leadto more departmental, divisional,project-oriented data marts. Mostseasoned data warehousingprofessionals now see bothapproaches as being part of amature informational data archi-tecture solution.

Most large organizations todayhave one or more data ware-houses as well as hundreds ofdata marts installed to supporttheir key businesses. Over theyears, the data warehouses havebecome more and more valuablefrom an analytical and planningstandpoint. But data warehouseshave also increased the pressurefor even more sophisticated inte-gration of enterprise data, espe-cially for those core datacategories that are central tofuture business processes.

Improving an organization’s dataarchitecture is fraught with manyof same problems that plagueurban planners. Reshaping an

urban area is often difficultbecause of competing politicalforces. Developers, largeemployers, and local activists allhave distinct interests that areopposed to one another. Dataarchitecture activities show simi-lar forces at work — major usersand COTS vendors play the role inenterprise and data architecturethat major employers and devel-opers play in the urban landscape.In a recent Executive Report, LukeHohmann and Ken Collier talkabout the growing importancethat COTS packages such as ERP,CRM, and SCM are having on dataarchitecture design in lots of com-panies [3]. Since organizationstypically make such huge invest-ments bringing in one of thesepackages, and these COTS ven-dors have much more money toinvest in design and architecture,enterprise data architecture isbeing set by these vendors.

Master Data Management

There are a huge number ofthings an organization couldinclude in a data warehouse ora data mart. However, time andagain, the same categories of datarise to the top of the agenda ofthings that managers and profes-sionals want to see included in anenterprise data warehouse, espe-cially customer and product data.

Obviously, the subjects that mostinterest management will have todo with those things that mostaffect the outcome of the busi-ness, namely the actors andobjects. So the actors of mostinterest are usually things like“customers,” “prospects,” “ven-dors,” “business partners,” “prod-ucts,” “employees,” “purchasedparts,” and “jobs.” In fields likeconstruction or software develop-ment, the object might be “proj-ect.” But in no case would theprimary subjects that manage-ment is interested in be much of asurprise to anyone familiar withthe particular business or market.

As data warehouses havematured, data warehouse toolvendors and consultants havenoticed that there was one con-stant theme: enterprises wantedmore and more data about theircustomers and products that weremost critical to their businesses.They also wanted more sophisti-cated ways of expressing the data.Moreover, enterprises wanted tointegrate not only more of theirinternal data into these moresophisticated relationships, theywanted to integrate more externaldata as well.

From these observations hascome a push toward MDM. (In

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 2211

18One of the leading proponents of the detail, normalized data warehouse solution was Bill Inmon. One of his major contributions to data warehousethinking was the idea of the “operational data store,” a database that served as a staging area for the cleansing, integration, and loading of what Icall the “core data warehouse.”

19 As stated earlier, the leading proponent of the star schema dimensional approach was Ralph Kimball. Taking a cue from multidimensional data-bases that were just coming online, Kimball came up with a strategy for mapping operational data into two major groups of tables: “fact tables,”which contain either detail or summarized business transaction data, and “dimensional tables,” which contain information on those major dimen-sions of business analysis (e.g., “customer,” “product,” location,” “time,” “geography").

Page 24: Data Architecture

case of confusion, you can substi-tute the term “subject” for “mas-ter” and think in terms of “subjectdata management” [SDM]. Itcomes down to pretty much thesame thing.) Master referencedata is the common set of defini-tions, identifiers, and attributes forthe most significant semantic datacategories across the enterprise.20

Master data management is theuse of the enterprise master refer-ence data to construct centrallymanaged data hubs about thecategories.

MDM involves developing increas-ingly sophisticated sets of datacentered around one or more ofthe major semantic componentsof an organization, typically actors,objects, locations. A set of popularmaster data categories is shownin Table 2.

Clearly, different master datais important in different indus-tries and different organizations,but that said, the importance of“customer” and “product” master

data hubs far exceeds all theothers. Indeed, customer andproduct hubs have becomeimportant enough to be consid-ered specific software marketsall by themselves.

MDM Models

There are a number of differentforms of MDM models. The mostprominent are hub-based MDMmodels and data warehouse-based MDM models.

Hub-Based MDM Models

From a marketing point of view,customer and product hubs arethe MDMs that are getting themost interest. Large enterprisesare vitally interested in under-standing and controlling customerand product data at an enterpriselevel, and all the major players(IBM, Oracle, SAP) haveannounced such hubs. Hubs arespecific applications provided byvarious vendors that support aspecific major actor or object.

These hubs aim to provide organi-zations with a common place tofind all of the data about a specificsemantic category. These hubsdiffer from one another in someinteresting ways. In the case of thelargest players, the hub vendorsprovide a base set of referencemetadata for an all-encompassingcustomer or product hub. Armedwith this set of reference data,organizations can modify the ref-erence data and then use themodified hub reference data asthe target for building (or extend-ing) the links to their operationaldata and/or their already devel-oped data warehouses.

In some other cases, the datahub comes more or less prebuilt.Often, these hubs have beendeveloped by COTS vendors whospecialized in specific market-places and provide links with theirCOTS operational data. In othercases, hub vendors come fromspecific industries (retail, auto,pharmaceuticals, manufacturing)and exploit their experience toprovide predefined master refer-ence data.

Data Warehouse–Based MDM Models

Another approach to MDM is todevelop custom hubs based onexisting data warehouses and BIinitiatives. This makes sense,since the biggest obstacle indeveloping a serious MDM hub isidentifying, finding, cleaning, andintegrating data from operational

VOL. 7, NO. 1 www.cutter.com

2222 BUSINESS INTELLIGENCE ADVISORY SERVICE

Customers

Vendors

Preferred suppliers

Trading partners

Employees

Shareholders

Products

Purchased parts

Services

Accounts

Assets

Policies

Locations

Offices

Regionals

Geographies

Organizational hierarchies

Sales territories

Actors Objects Locations Relationships

Table 2 — Popular Master Data Categories

20Notice that there are no significant differences in the definition of master reference data and the definition of any data within an organization,except for the explicit stipulation that the definitions be enterprise-wide.

Page 25: Data Architecture

and external data sources. If anorganization has already spentmultiple years and millions of dol-lars integrating data from a num-ber of operational and externalsources into an enterprise or divi-sional warehouse, it is natural thatthe next evolutionary step (i.e.,developing an MDM hub) wouldbe to build on these existinginvestments.

Basically, just as there are twofundamental data warehousedesign strategies (data ware-house–oriented or data mart–oriented), we are also seeingtwo major data warehouse–basedMDM models: (1) MDMs builtaround core data warehouses/ODSs and (2) MDMs built arounddata marts using star-schemamodels.

ODS-Based MDM Models

One of the most interestingapproaches for getting from datawarehousing to MDM is found in aset an ideas that uses the meta-data of an ODS as the basis forcreating custom master data hubs(see Figure 4). One ODS-basedapproach has been developedby ObjectRiver [4]. The strategyinvolves analyzing entity-relation-ship (ER) information from anODS to identify “coarse grainedbusiness objects”21 by looking forthe natural hierarchies within thedata model. This is done by firstidentifying those entities with justprimary keys (unique identifiers)and then looking for the relation-ship of those tables that aredirectly attached to those tableswith just primary keys. In a way,

the approach is somewhat equiva-lent to picking up the ER diagramof the ODS by one of these keyentities and seeing what isattached.

One advantage of this approach isthat there are no major data con-version projects involved, sincethe data already exists, whichmeans that organizations can startgetting value sooner. Anotheradvantage is that there are awider range of master data sub-jects than one would find by justfocusing on customer or producthubs. Finally, the approach workswell with SOA, since the approachcan automatically build a “dataadapter” to facilitate OO access toexisting relational data.

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 2233

21In [4], Steven Lemmo writes: “We call these ‘coarse grained’ because they are aggregations of entities that are linked through defined relationships.”

Data Hub

BI Applications

Data Hub

Population

Applications Master

Data Hubs

ETL

Applications

Operational

Applications

ETL

Applications

COTS

Applications

ETL

Applications

Operational

Applications Operational

DBs

COTSI

DBs

Operational

DBs

External

Data

Operational

Data Store (ODS)

Figure 4 — ODS-based MDM model.

Page 26: Data Architecture

Data Mart–Based MDM Models

Another strategy for developingMDM models is to leverage theexisting data mart–based datawarehouse projects to developMDM hubs. The popularity ofKimball’s star schema structuresfor data warehouse design makesthem another natural startingpoint for organizations that wantto develop their own custom datahubs. The problem here is a littlemore complex since the focus ofstar schema designs are the facttables and dimension tableswhere the fact tables usually rep-resent base transactions (mes-sages) and dimensions representactors, objects, events (time), andlocation. This means that the cus-tomer data (which is naturally adimension) is apt to be found inmany data marts connected to anumber of different fact tables.Rationalizing the information hereis somewhat more difficult than inthe case where there is a singlestaging area that has a single setof metadata to analyze. A numberof organizations that specialize incombining metadata from differ-ent data marts and data ware-houses are focusing on this area.

MDM Issues

Most MDM hubs focus on eitherone of the primary business actors(customer, trading partner) or oneof the primary business objects(product, service). By bringing thisdata together, organizations havemuch more control over theirdecision making and processmanagement. The problem is

that the “glue” that hangs datatogether is not the business actorsor business objects, it is the mes-sages or transactions. In an orga-nization with both a customer huband a product hub, where do the“orders” reside? The answer, ofcourse, is that they reside in bothplaces. Orders, shipments, bills,and payments all tie enterprises totheir customers and their prod-ucts. The same is true with ven-dors and purchased products — itis the purchase orders, receipts,vendor invoices, and vendor pay-ments that tie them together.

Ultimately, I expect that MDM willsimply move our organizations tothe next plateau of data architec-ture, where we begin to integratethe data from our MDM hubs tocreate a true enterprise dataarchitecture.

Another plus for the MDMs is theincreasing interest to use thesehubs as a point of integration of allthe unstructured data we haveabout our customers and prod-ucts. This data is growing rapidlyand is enormously fragmented.Bringing this data together under acommon management strategywill have enormous long-rangebenefit to our organizations.

CONCLUSION

The key to better decisions andbusiness process improvement isbetter data!

Of all the elements of EA, dataarchitecture is both the mostimportant and most mature, in

no small part because data is ourmost valuable enterprise IT asset.Over the last two decades, dataarchitecture in large and medium-sized enterprises has undergonedramatic changes. The most sig-nificant of these changes hasbeen the introduction of datawarehousing as a means of “ratio-nalizing” databases specificallydesigned for data access asopposed to operations. This hasreenergized business intelligence.

Data architecture is now under-going yet another major changecalled master data management.MDM focuses enterprise datamanagement efforts around thecorporate jewels, the data associ-ated largely with major actors andobjects like customer and product(i.e., the critical data that is themost important from a strategicstandpoint). These data hubs areincreasingly comprehensive andincreasingly sophisticated, incor-porating data not only from mostof the necessary enterprisessources, but from external,syndicated sources as well.

Data is the lifeblood of big orga-nizations. The bigger the enter-prise, the more important databecomes. As organizationsbecome more electronic and real-time, the more significant thatdata architecture, data warehous-ing, and MDM become. Whilethere are no magic bullets, theseare initiatives that leading enter-prises are depending on to man-age the future.

VOL. 7, NO. 1 www.cutter.com

2244 BUSINESS INTELLIGENCE ADVISORY SERVICE

Page 27: Data Architecture

REFERENCES

1. Codd, E.F. “A Relational Modelof Data for Large Shared DataBanks.” Communications of theACM, Vol. 13, No. 6, June 1970.

2. Helland, Pat. “Metropolis.”Microsoft Architect Journal,Microsoft Press, Vol. 2, April 2004.

3. Hohmann, Luke, and KenCollier. “Do You Run from or toEmbedded Business Intelligence?”Cutter Consortium BusinessIntelligence Executive Report,Vol. 6, No. 12, 2006.

4. Lemmo, Steven. “LeveragingData Models to Create a UnifiedBusiness Vocabulary for SOAs.”DM Review, 25 January 2007.

5. Lewis, Grace Alexandra,Santiago Comella-Dorda, PatPlace, Daniel Plakosh, and RobertC. Seacord. “An EnterpriseInformation System DataArchitecture Guide.” TechnicalReport CMU/SEI-2001-TR-018,Carnegie Mellon SoftwareEngineering Institute, 2007.

6. Lyman, Peter, and Hal R. Varian.“How Much Information 2003?”University of California atBerkeley, October 2003.

ABOUT THE AUTHOR

Ken Orr is a Fellow of theCutter Business TechnologyCouncil and a Senior Consultantwith Cutter Consortium’s AgileProject Management, BusinessIntelligence, Business-ITStrategies, and EnterpriseArchitecture practices. He isalso a regular speaker at CutterSummits and symposia. Mr. Orris the founder of and ChiefResearcher at the Ken OrrInstitute, a business technologyresearch organization. Previously,he was an Affiliate Professorand Director of the Center forthe Innovative Application ofTechnology with the School ofTechnology and InformationManagement at WashingtonUniversity. He is an internationallyrecognized expert on technologytransfer, software engineering,information architecture, anddata warehousing. Mr. Orr hasmore than 30 years’ experiencein analysis, design, project man-agement, technology planning,and management consulting.He is the author of StructuredSystems Development, StructuredRequirements Definition, and TheOne Minute Methodology. He canbe reached at [email protected].

©2007 CUTTER CONSORTIUM VOL. 7, NO. 1

EXECUTIVE REPORT 2255

Page 28: Data Architecture

Abou

t the

Pra

ctice Business Intelligence

PracticeThe strategies and technologies of business intelligence and knowledgemanagement are critical issues enterprises must embrace if they are to remaincompetitive in the e-business economy. It’s more important than ever to makethe right strategic decisions the first time.

Cutter Consortium’s Business Intelligence Practice helps companies take all theirenterprise data, augment it if appropriate, and turn it into a powerful strategicweapon that enables them to make better business decisions. The practice is uniquein that it provides clients with the full picture: technology discussions, productreviews, insight into organizational and cultural issues, and strategic advice acrossthe full spectrum of business intelligence. Clients get the background they need tomanage technical issues like data cleansing as well as management issues such ashow to encourage employees to participate in knowledge sharing and knowledgemanagement initiatives. From tactics that will help transform your company to aculture that accepts and embraces the value of information, to surveys of the toolsavailable to implement business intelligence initiatives, the Business IntelligencePractice helps clients leverage data into revenue-generating information.

Through Cutter’s subscription-based service and consulting, mentoring, and training,clients are ensured opinionated analyses of the latest data warehousing, datamining, knowledge management, CRM, and business intelligence strategies andproducts. You’ll discover the benefits of implementing these solutions, as wellas the pitfalls companies must consider when embracing these technologies.

Products and Services Available from the Business Intelligence Practice

• The Business Intelligence Advisory Service• Consulting• Inhouse Workshops• Mentoring• Research Reports

Other Cutter Consortium PracticesCutter Consortium aligns its products and services into the nine practice areasbelow. Each of these practices includes a subscription-based periodical service,plus consulting and training services.

• Agile Project Management • Business Intelligence• Business-IT Strategies• Business Technology Trends & Impacts• Enterprise Architecture• IT Management• Measurement & Benchmarking Strategies• Enterprise Risk Management & Governance• Sourcing & Vendor Relationships

Senior ConsultantTeamThe Senior Consultants on Cutter’s BusinessIntelligence team are thought leaders in themany disciplines that make up businessintelligence. Like all Cutter ConsortiumSenior Consultants, each has gained a stellarreputation as a trailblazer in his or her field.They have written groundbreaking papers andbooks, developed methodologies that havebeen implemented by leading organizations,and continue to study the impact thatbusiness intelligence strategies and tactics arehaving on enterprises worldwide. The teamincludes:

• Verna Allee• Stowe Boyd• Ken Collier• Clive Finkelstein• Jonathan Geiger• David Gleason• Curt Hall• David C. Hay• André LeClerc• Lisa Loftis• David Loshin• Larissa T. Moss• Ken Orr• Gabriele Piccoli• Thomas C. Redman• Ricardo Rendón• Michael Schmitz• Ed Schuster• Karl M. Wiig