Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2
Part 1:
Data Warehouse Concepts
3
Data Management
� A critical success factor: IT applications cannot be done without using data.
� The Difficulties of managing Data: • The amount of data increases exponentially with time
• Data are scattered throughout organization
• An ever-increasing amount of external data needs
• Data security, quality, and integrity are critical
4
Data Life Cycle
5
Data Sources
� Internal Data Sources: data about people,
products, services, and processes.
� Personal Data: IS users or other corporate
employees may document their own expertise
by creating personal data.
� External Data Sources: Data from commercial
databases to sensors and satellites.
6
Data, Data everywhere yet ...
� I can’t find the data I need
� data is scattered over the network
� many versions, subtle differences
� I can’t get the data I need
� need an expert to get the data
� I can’t understand the data I found
� available data poorly documented
� I can’t use the data I found
� results are unexpected
� data needs to be transformed from one
form to other
7
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand and
use in a business context.
[Barry Devlin]
8
Which are ourlowest/highest margin
customers ?
Which are ourlowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customersare most likely to go to the competition ?
Which customersare most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
Why Data Warehousing?
9
Decision Support
� Used to manage and control business
� Data is historical or point-in-time
� Optimized for inquiry rather than update
� Use of the system is loosely defined and can
be ad-hoc
� Used by managers and end-users to
understand the business and make judgements
10
Evolution of Decision Support
� 60’s: Batch reports
� hard to find and analyze information
� inflexible and expensive, reprogram every request
� 70’s: Terminal based DSS and EIS
� 80’s: Desktop data access and analysis tools
� query tools, spreadsheets, GUIs
� easy to use, but access only operational db
� 90’s: Data warehousing with integrated OLAP
engines and tools
11
Definition of data warehousing
According to W.H.Inmon
A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of management’s
decision making process.
12
Characteristics of a Data Warehouse Subject-Oriented. Data are organized by subject and contain
information relevant for decision support only .
Consistency. Data in different operational databases may be encoded differently . In the data warehouse, though, they will be coded in a consistent manner.
Time variant. The data are kept for many years so that they can be used for trends, forecasting, and comparisons over time.
Non-volatile. Data are not updated once entered into the warehouse.
Multidimensional. Typically the data warehouse uses a multidimensional structure .
Web-based. Today’s data warehouse are designed to provide an efficient computing environment for web-based applications.
� often a copy of operational data
� with value-added data (e.g., summaries, history)
13
Why a Warehouse?
� Two Approaches:
� Query-Driven (Lazy)
� Warehouse (Eager)
Source Source
?
14
Query-Driven Approach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
15
Warehouse Architecture
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
16
Advantages of Warehousing
� High query performance
� Queries not visible outside warehouse
� Local processing at sources unaffected
� Can operate when sources unavailable
� Can query data not stored in a DBMS
� Extra information at warehouse
� Modify, summarize (store aggregates)
� Add historical information
17
Advantages of Query-Driven
� No need to copy data
� less storage
� no need to purchase data
� More up-to-date data
� Only query interface needed at sources
18
OLTP vs. OLAP
� OLTP: On Line Transaction Processing
� Describes processing at operational sites
� OLAP: On Line Analytical Processing
� Describes processing at warehouse
19
OLTP vs. OLAP
� Mostly updates
� Many small transactions
� Mb-Tb of data
� Raw data
� Clerical users
� Up-to-date data
� Consistency,
recoverability critical
� Mostly reads
� Queries long, complex
� Gb-Tb of data
� Summarized, consolidated
data
� Decision-makers, analysts
as users
OLTP OLAP
20
OLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
21
Data Marts
Data Mart: A small data warehouse designed for a
strategic business unit ( SBU) or a department
The advantage of data marts include: low cost
(Prices under $100,000 versus $1million or more for
data warehouses); significantly shorter lead time for
implementation (often less than 90 days), local rather
than central control (conferring power on the using
group), More rapid response and more easily
understood and navigated than an enterprise wide
data warehouse .
22
Part 2:
Data Warehouse Design
23
Building a Data Warehouse
24
Relational and Multidimensional Database
� Relational databases store data in two –
dimensional tables. Multidimensional
databases typically store data in arrays, which
consist of at least three business dimension.
25
Relational data model
� based on a single structure of data values in a two
dimensional table
CUSTOMER ORDER
………
…Lyn002
…Robert001
…Cus_nameCus_id
…
03 Dec 02
02 Dec 02
Ord_date
………
…Lyn02
…00201
…Cus_idOrd_no
26
A Sample Data CubeTotal annual sales
of TV in U.S.A.Date
Pro
duct
Cou
ntr
y
sum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
27
Cuboids Corresponding to the Cubeall
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
• Cuboids show the data at different degrees of summarization.
• Given a set of dimensions, we can construct a lattice of cuboids, each showing the data at a different level of summarization, or group by. The lattice of cuboids is then referred to as a data cube.
28
Multidimensional Data Model
� Composed of one fact table and a set of dimension tables.
� Dimensional table: each dimension table has a simple table (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table.
� A multidimensional data model is typically organized around a central theme, like sales, for instance.
29
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
� Star schema: A fact table in the middle connected to a set
of dimension tables
� Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
� Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
30
Example of Star Schematime_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
31
Example of Snowflake Schematime_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
province_or_street
country
city
32
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
33
Typical OLAP Operations� Roll up (drill-up): summarize data
� by climbing up hierarchy or by dimension reduction
� Drill down (roll down): reverse of roll-up
� from higher level summary to lower level summary or detailed data,
or introducing new dimensions
� Slice and dice:
� project and select
� Pivot (rotate):
� reorient the cube, visualization, 3D to series of 2D planes.
� Other operations
34
Operations
� Rollup: summarize data
� e.g., given sales data, summarize sales for last
year by product category and region
� Drill down: get more details
� e.g., given summarized sales as above, find
breakup of sales by city within each region, or
within the Andhra region
35
More Cube Operations
� Slice and dice: select and project
� e.g.: Sales of soft-drinks in Andhra over the last
quarter
� Pivot: change the view of data
36
Cube
sale prodId storeId amt
p1 c1 12
p2 c1 11
p1 c3 50
p2 c2 8
c1 c2 c3
p1 12 50
p2 11 8
Fact table view:Multi-dimensional cube:
dimensions = 2
37
3-D Cube
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
dimensions = 3
Multi-dimensional cube:Fact table view:
38
Another Example
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId date amt
p1 1 62
p2 1 19
p1 2 48
drill-down
rollup
39
Pivoting
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3
p1 56 4 50
p2 11 8
40
Design a Warehouse?
� Design data warehouse
� Design data marts
� Design representation (Star schema, …)
� gathering data
� Which data is needed?
� Where does it come from?
� cleansing, integrating, ...
� querying, reporting, analysis
� data mining
� monitoring, administering warehouse
41
Data Gathering
� Periodic snapshots
� Database triggers
� Log shipping
� Data shipping (replication service)
� …
42
Integration
� Data Cleaning
� Data Loading
� Derived DataClient Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
43
Data Cleaning
� Migration (e.g., yen � dollars)
� Fusion (e.g., mail list, customer merging)
billing DB
service DB
customer1(Joe)
customer2(Joe)
merged_customer(Joe)
44
Loading Data
� Incremental vs. refresh
� Off-line vs. on-line
� Frequency of loading
� At night, 1x a week/month, continuously
� Parallel/Partitioned load
45
Derived Data
� Derived Warehouse Data
� When to update derived data?
46
Part 3:
Data Mining
47
Data Mining Concepts
Data mining: The process of searching for
valuable business information in a large
database, data warehouse, or data mart.
48
Data Mining Application
Retailing and sales
Banking
Manufacturing and production
Insurance
Police work
Health care
Marketing
49
Text Mining
The application of data mining to non-
structured or less-structured text files.
50
Web Mining
The application of data mining techniques to
discover actionable and meaningful patterns form
web resources.
Web mining is used in the following areas:
information filtering, surveillance, mining of web-
access logs for analyzing usage.
51
Some basic operations
� Predictive:
� Regression
� Classification
� Descriptive:
� Clustering / similarity matching
� Association rules and variants
52
Classification
� Given old data about customers and
payments, predict new applicant’s loan
eligibility.
Age
Salary
Profession
Location
Customer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/
bad
53
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
� Regression: (linear or any other polynomial)
� a*x1 + b*x2 + c = Ci.
� Nearest neighbor
� Decision tree classifier
� Neural networks
54
� Tree where internal nodes are simple
decision rules on one or more attributes and
leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBadGood
55
Neural network
� Set of nodes connected by directed weighted
edges
Hidden nodes
Output nodes
x1
x2
x3
x1
x2
x3
w1
w2
w3
y
n
i
ii
ey
xwo
−
=
+
=
= ∑
1
1)(
)(1
σ
σ
Basic NN unitA more typical NN
56
Bayesian learning
� Assume a probability model on generation of data.
� Apply Bayes theorem to find most likely class as:
)(
)()|(max)|(max :class predicted
dp
cpcdpdcpc
jj
cj
c jj
==
57
Clustering
� Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product.
� Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
58
What is association rule mining?
Hmmm, which items are frequently
purchased together by my customers?
milkcereal
breadmilk
bread
butter
milk bread
sugar eggs
Customer 1
Market Analyst
Customer 2
sugareggs
Customer n
Customer 3
Shopping Baskets 100100Basket 4
001011Basket 3
100111Basket 2
010011Basket 1
eggscerea
l
butte
r
suga
r
brea
d
milk
59
What is association rule mining? (cont.)
100100Basket 4
001011Basket 3
100111Basket 2
010011Basket 1
eggscerealbuttersugarbreadmilk
211233count
Support (milk)=3
Support (bread)=3
Support (sugar)=2
……
Support (milk U bread)=3
Support (milk U sugar)=1
……
Support (milk U bread U sugar)=1
……
Support (milk U bread U sugar U butter U cereal U eggs)=0
Confidence (A → B)=Support (A U B)/Support (A)
As Confidence (milk → bread) =
= Support (milk U bread)/Support (milk) = 3/3 = 100%,
Then milk → bread
If Confidence (A → B) >= min_conf, Then A → B
60
How DM improve your business?
Strategy 1: Placing milk
and bread within close
proximity may further
encourage the sale of
these items together
within single visits to
the store.
61
How DM improve your business?
Strategy 2: Placing milk and bread at opposite ends of the store may entice customers who purchase such items to pick up other items along the way.
62
How DM improve your business?
Strategy 3:Put these two
items into a package
at reduced price.
63
Stage in the evolution of knowledge discovery
Proactive , integrative ;
multiple business partners
Neural computing
advanced al models,
complex optimization,
web services
What is the best plan to
follow? how did we
perform compared to
metrics?
Advanced intelligent
systems; complete
integration(2000-2004)
Prospective , proactive
information delivery
Advanced algorithms,
multiprocessor computers,
massive databases
What’s likely to happen to
the tBoston unit’s sales
next month ? Why?
Intelligent data mining
(late 1990s)
Retrospective , proactive
data delivery at multiple
level
OLAP, multidimensional
databases, data
warehouses
What were the sales in
region A by product , by
salesperson?
Data warehousing and
decision support (early
1990s)
Retrospective , dynamic
data delivery at record
level
Relational databases
(RDBMS), structured
query language (SQL)
What were unit sales in
new England last March ?Data access (1980s)
Retrospective , static data
delivery
Computers ,tapes , disks What was my total
revenue in the last 5
years?
Data collection(1980s)
Business question enabling technologies characteristic Evolutionary stage