Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Jef Wijsen Data Warehousing and Data Mining 1
'
&
$
%
Data Warehousing and Data Mining
Jef Wijsen
Universite de Mons-Hainaut (UMH)
Service de Science des Systemes d’Information
http://staff.umh.ac.be/Wijsen.Jef/
March 19, 2005
Jef Wijsen Data Warehousing and Data Mining 2
'
&
$
%
⇒ • 1 Situation . . . . . . . . . . . . . . . . . . . 3
• 2 Data Warehouse . . . . . . . . . . . . . . 10
• 3 Testimony . . . . . . . . . . . . . . . . . . 16
• 4 OLAP . . . . . . . . . . . . . . . . . . . . 18
• 5 Data Mining . . . . . . . . . . . . . . . . . 28
• 6 Next Generation Data Warehousing/Mining 42
• 7 Important Players . . . . . . . . . . . . . . 47
• 8 Selected Literature . . . . . . . . . . . . . 49
Jef Wijsen Data Warehousing and Data Mining 3
'
&
$
%
1 Situation
• R&D in OnLine Transaction Processing (OLTP) since the sixtieshas resulted in (relational) database systems.
• Digitizing and storing data is simple and cheap. E.g., bar codes.
• Huge amounts of historical and operational data may hidenuggets of information (rules, trends, patterns,. . . ) on thebusiness.
• New challenge: disclose previously unknown knowledge so that itcan be used by managers.
Jef Wijsen Data Warehousing and Data Mining 4
'
&
$
%
Case Study
Borrowed from www.internetweek.com.
Sports car owners fall into a high-risk category, in the conventionalwisdom of auto insurance underwriters.
Knowledge Discovery But by mining driver safety data in its newdata warehouse, Farmers Insurance Group has found that ifsports car enthusiasts also own a second, conventional car, theymay be safe-enough drivers to be attractive as policyholders.
Effective Use “We found a microniche among all sports carowners,” said Tom Boardman, an assistant actuary at Farmers[. . . ]. “As a result, we changed how we underwrite and pricesome sports car policies,” he said.
Jef Wijsen Data Warehousing and Data Mining 5
'
&
$
%
Business Opportunities
Banking Which prospects are most likely to become profitablecustomers?
Retail What will this customer want next?
Government tax agency Which tax returns are likely to benon-compliant?
Government intelligence agency What specific event is mostlikely to be a security threat?B. Thuraisingham: Web Data Mining and Applications inBusiness Intelligence and Counter-Terrorism
Jef Wijsen Data Warehousing and Data Mining 6
'
&
$
%
The Focus of the Talk is on Technological Issues
Three new technological developments in the field of decision supportsystems:
Data warehousing. OLTP data are often scattered over differentsystems, highly detailed, and/or of poor quality. Datawarehousing involves integrating, aggregating/summarizing, andcleaning these data in a new data repository, called datawarehouse.
OnLine Analytical Processing (OLAP). Online analysis of thedata warehouse content; data is represented in multidimensionalspreadsheets, called data cubes.
Data mining. Exploring data in search of interesting, newknowledge (rules, trends, regularities, patterns,. . . ).
Jef Wijsen Data Warehousing and Data Mining 7
'
&
$
%
OLAP versus Data Mining
OLAP User-driven hypothesis verification. The analyst posing thequery usually tells the system exactly what query to execute; i.e.on which portion of the data to focus.
“Give the monthly number of customers that left in thepast year.”
Data mining Data-driven hypothesis building. A data-miningquery goes a step beyond, inviting the system to decide wherethe focus should be.
“What factors affect the loss of customers?”
Jef Wijsen Data Warehousing and Data Mining 8
'
&
$
%
Knowledge Discovery in Databases (KDD)
KDD ≈
data warehousing (integration, aggregation, cleaning)
+
OLAP
+
data mining
Jef Wijsen Data Warehousing and Data Mining 9
'
&
$
%
OLAP versus OLTP
OLTP OLAP
end-user clerk. manager.
workload frequent transactions: regular analyses:
access read and write, mostly read only,
a limited number of records. scanning millions of records.
data actual. actual and historical.
DB size 100 MB to GB. 100 GB to TB (= 106 MB).
These differences constitute an additional argument for building adata warehouse separate from existing transactional databases.
Jef Wijsen Data Warehousing and Data Mining 10
'
&
$
%
2 Data Warehouse
2.1 What is a Data Warehouse?
A data repository for decision support, with the followingcharacteristics:
Subject-oriented and integrated. OLTP data are often scatteredover multiple applications (invoicing, delivery, production,. . . ).Data is integrated in a data warehouse around a number ofsubjects (client, product, supplier,. . . ).
Non-volatile and historical. Data, once entered in the warehouse,is not subject to change.Data covers a certain period of time (e.g., ten years) in order toallow trend analysis.
Jef Wijsen Data Warehousing and Data Mining 11
'
&
$
%
2.2 What is a Data Mart?
A departmental data warehouse focusing on a specific part of thebusiness.E.g., marketing data mart about the subjects client, product, andsales.
Two types of data marts:
Data mart without data warehouse. These data marts can berealized more easily as they do not require a business-wideconceptual data model. However, they can raise complexintegration problems in the long run.
Data mart extracted from the data warehouse. For reasons ofincreased flexibility and performance.
Jef Wijsen Data Warehousing and Data Mining 12
'
&
$
%
2.3 Constructing a Data Warehouse
Extraction Extracting data from transactional databases andother data repositories. Typically an overnight batch process.
Cleaning GIGO principle (Garbage In Garbage Out). . .
• completing missing values and NULLs,
• correcting typos and other errors,
• unifying synonyms,
• . . .
Data that are obviously erroneous but cannot be corrected, areremoved.
Jef Wijsen Data Warehousing and Data Mining 13
'
&
$
%
Integration and transformation Fusing different data sources.
• Matching entity identifications, e.g., client id and cl nr.
• Unifying data expressed along different scales, e.g., BEF andEURO.
• Translating addresses into coordinates.
• Aggregating individual sales into daily sales figures.
• Normalizing variables between 0 and 1.
Jef Wijsen Data Warehousing and Data Mining 14
'
&
$
%
Load and refresh
• Loading the data into the warehouse involves creating indexes tospeed up queries.
• Modifications in OLTP databases are propagated regularly to thethe data warehouse (copy management).
Jef Wijsen Data Warehousing and Data Mining 15
'
&
$
%
DB
DB
Clean
Integrate
Transform
Load
BB
BB
££
££
Data
WarehouseServe B
BBB
££
££Data Mining
OLAP
Metadata
6? 6?
Jef Wijsen Data Warehousing and Data Mining 16
'
&
$
%
3 Testimony
H. Van Puyvelde. De l’information operationnelle a l’intelligencedecisionnelle par le data mining–Etude de faisabilite appliquee au casd’un service public. Master’s thesis, Universite de Mons-Hainaut,2000.
Company uses a dozen of important applications on four differentDBMS platforms.
Initial challenge: Apply data mining to answer questions like
• “Who are our clients?”
• “Which services are most beneficial to our clients?”
However, a thorough data preparation was mandatory. . .
Jef Wijsen Data Warehousing and Data Mining 17
'
&
$
%
Some difficult quality problems:
• Double registration of the same entity.E.g., 〈RAYTEC, Rue de Commerce 2,. . . 〉 and〈S.A. RAYTEC, 2 Rue de Commerce,. . . 〉.
• Multiple use of the hold-all code “others” for attributes likeskills or profession.
• Missing, impossible, or outdated attribute values.
This confirms others’ experiences:
Preparation of the data [. . . ] can easily take up to 80% ofthe time needed for the whole KDD [Knowledge Discovery inDatabases]; this is not surprising, since the difficulties indata integration are well known.”
[Mannila 96]
Jef Wijsen Data Warehousing and Data Mining 18
'
&
$
%
4 OLAP
4.1 Data Cube
Typically, OLAP analyses are based on summary reports, e.g., thedaily sales amounts by store and product.
The data can be naturally represented in a “data cube”:
• The cube dimensions correspond to the independent variables,e.g., day, store, and product.
• The cube cells contain the corresponding values for thedependent variable, e.g., the number of pieces sold.
OLAP software provides several ways of visualizing data cubes.
Jef Wijsen Data Warehousing and Data Mining 19
'
&
$
%
6
HHHHY
©©©©*day
store
product
HHHHHHHH
HHHHHHHH
HHHHHHHH
HHHHHHHH
©©©©
©©©©
©©©©
©©©©
©©©©©©©©©©©©©©©©©©©©
HHHHHHHH
HHHHHHHH
HHHHHHHH
46
44
44
45
33
34
28
27
36
35
27
28
46
50
33
72
36
73
46
5044
5144
5245
51
Kinderdroom
Navona
Cremona
LegoScrabble
1 Jan 2001
15 Jan 2001
1 Feb 2001
15 Feb 2001
A 3-dimensional data cube.
Jef Wijsen Data Warehousing and Data Mining 20
'
&
$
%
Concept Hierarchies
Typically, dimensions are organized in concept hierarchies thatdetermine logical ways of grouping data.
r day
r month
r year
r store
r region
r product
r class
Jef Wijsen Data Warehousing and Data Mining 21
'
&
$
%
4.2 Rollup: A Typical OLAP Query
Rollup queries provide for each dimension the level at which data isto be presented.
“Give total sales amounts by product, region, and month”.
HHHH
HHHH
HHHH
©©©©
©©©©
©©©©
©©©©©©©©©©©©
HHHH HHHH HHHH
90
89
138
110
90
101
138
294
90
10189
103
Belgium
Italy
LegoScrabble
Jan 2001
Feb 2001
The cuboid month region product
Jef Wijsen Data Warehousing and Data Mining 22
'
&
$
%
OLAP queries can reduce the number of dimensions.
“Give total sales figures by product and month, over all stores.”
HHHH
HHHH ©©©©
©©©©©©©©©©©©©©©©
HHHH HHHH HHHH
228
199
228
395228
395199
203LegoScrabble
Jan 2001
Feb 2001
The cuboid month product.
Jef Wijsen Data Warehousing and Data Mining 23
'
&
$
%
“Pre-materializing” Cuboids
q day store product
q month store product
q year store product
q store product
q day region product
q month region product
q year region product
q region product
q day product
q month product
q year product
q product
»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
q day store class
q month store class
q year store class
q store classq day region class
q month region class
q year region class
q region class
q day class
q month class
q year class
q class
»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
q day store
q month store
q year store
q storeq day region
q month region
q year region
q regionq day
q monthq year
q
»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
Jef Wijsen Data Warehousing and Data Mining 24
'
&
$
%
4.3 Technological Choice: ROLAP or MOLAP
Technological challenges in OLAP:
efficient support of spreadsheet operations on databases ofmultiple gigabytes.
Depending on the technology used, one can classify OLAP softwareinto two categories:
• ROLAP (Relational OLAP), or
• MOLAP (Multidimensional OLAP).
Jef Wijsen Data Warehousing and Data Mining 25
'
&
$
%
ROLAP
• The data cube is stored in relational tables, in a so-called“star-scheme.”
• Relational database servers are extended with specializedmiddleware for OLAP support. E.g., Microsoft SQL ServerOLAP Services.
• The relational query language SQL is extended with specialOLAP primitives.
Jef Wijsen Data Warehousing and Data Mining 26
'
&
$
%
Star Scheme
day store product count
1 Jan 2001 Kinderdroom Lego 46
1 Jan 2001 Kinderdroom Scrabble 50
. . . . . . . . . . . .
1 Jan 2001 all Lego 115
. . . . . . . . . . . .
product class
Lego toys
. . . . . .
store region
Kinderdroom Begium
. . . . . .
day month year
. . . . . . . . .
£££££££±
½½
½½
½½
½½
½½>
»»»»»»»»:
Jef Wijsen Data Warehousing and Data Mining 27
'
&
$
%
MOLAP
• MOLAP uses multidimensional databases storing data in(sparse) matrices.
• This way of storing data may be more efficient than in ROLAP.
• A drawback is that integration with existing SQL databases ismore difficult.
Jef Wijsen Data Warehousing and Data Mining 28
'
&
$
%
5 Data Mining
• In OLAP, the end user guides the analysis:
1. choice of dimensions and dependent variables, and
2. specification of queries.
• Problem: the content of the data warehouse is often not wellunderstood so that it is quasi impossible to select the right datacube and ask the good questions.
• Starting point of data mining: use computer power to discoverinteresting patterns from databases—rather than verifyinghypotheses.
Jef Wijsen Data Warehousing and Data Mining 29
'
&
$
%
E.g., credit scoring. Using historical warehouse data on overduecredits, a data ming program may discover the following rule:
If income ≤ 20.000 Euro and seniority ≤ 5 years
then risk = high, else risk = low
Jef Wijsen Data Warehousing and Data Mining 30
'
&
$
%
5.1 Data Mining Applications
• Automatic classification of sky objects.
• Fraud detection.
• Credit scoring.
• Targeted mailing.
• Scouting in NBA (IBM Advanced Scout).
• . . .
Jef Wijsen Data Warehousing and Data Mining 31
'
&
$
%
Targeted Mailing
¡¡
¡¡
¡¡
¡¡
¡¡
¡¡
¡¡
¡
amount of messages sent
amount of responses received
Targeted mail
Mass mail
Gain¾ -
Jef Wijsen Data Warehousing and Data Mining 32
'
&
$
%
5.2 Data Mining Tasks and Techniques
Tasks Techniques Algorithms
Prediction Decision trees ID3, C4.5,. . .
(Classification & Classification rules covering algorithm,. . .
Regression) Bayesian networks
Neural networks
Association rules Apriori and its variants
Clustering Partitioning k-Means, k-Medoids,. . .
Hierarchical BIRCH, CURE,. . .
Density-based DBSCAN, OPTICS,. . .
Jef Wijsen Data Warehousing and Data Mining 33
'
&
$
%
5.3 Classification
• Input = (historical) training data with known class labels. E.g.,
Age . . . Car Risk
young . . . sport high
middle . . . sport high
middle . . . family low
......
......
old . . . sport low
• Build a model that predicts the class given values for the otherattributes.
• Test the model on a separate data set.
• Use the model to predict new cases.
Jef Wijsen Data Warehousing and Data Mining 34
'
&
$
%
Classification by Decision Trees
Age . . . Car Risk
young . . . sport high
middle . . . sport high
middle . . . family low
......
......
old . . . sport low
→
high
high
low
low
²±
¯°Age
²±
¯°Car
©©©©HHHH
¡¡
@@
young middle old
sport family
Jef Wijsen Data Warehousing and Data Mining 35
'
&
$
%
More Complex 6⇒ Better Prediction
Model complexity
Prediction error
Test data
Training data
Jef Wijsen Data Warehousing and Data Mining 36
'
&
$
%
Classification by Neural Networks
User’s view:
µ´¶³µ´¶³
Input
Input
-Black Box
Caveat:Experience shows that in many applications [. . . ], the explicitknowledge structures that are acquired, the structural descriptions,are at least as important, and often very much more important, thanthe ability to perform well on new examples. People frequently usedata mining to gain knowledge, not just predictions.
[Witten and Frank, pp. 7–8]
Jef Wijsen Data Warehousing and Data Mining 37
'
&
$
%
A Look Insight...
µ´¶³µ´¶³
Input
Input
µ´¶³µ´¶³µ´¶³
µ´¶³
HHj
HHjAAAAAU
©©*
©©*
¢¢¢¢¢
JJ
JJ-
Á
-
Jef Wijsen Data Warehousing and Data Mining 38
'
&
$
%
Neuron
&%
'$∑
f
Qs
´´
3
I1 w1
Inwn
...-
f(∑n
j=1 wj × Ij)
where:
threshold function
f
-6
Once the topology (number of layers, number of neurons per layer) ofthe network is fixed, the weight wj of each connection is chosen so asto optimize the prediction on training data.
Jef Wijsen Data Warehousing and Data Mining 39
'
&
$
%
5.4 Boolean Association Rules
1 {hammer, crowbar, nails}2 {hammer, saw, screw}3 {hammer, crowbar, nails, screw}4 {hammer, crowbar, saw, nails}5 {screw}
• The association rule
hammer→ crowbar, nails
has support 3 and confidence 3/4.Find all rules that exceed given support and confidencethresholds.
• Very popular research topic.
Jef Wijsen Data Warehousing and Data Mining 40
'
&
$
%
5.5 Clustering
• Unlike with classification, there are no known class labels.
• Maximize cohesion:distance between clusters >> distances within clusters
10 20 30 40 50 60 70 80
Age100K
200K
300K Income
rr r
rr
rr
rr
rrr
r
p
rrr
rrr
r
r
r
r
r rrrr rrr
r rr rr rr rr rr r
&%
'$
&%
'$
Jef Wijsen Data Warehousing and Data Mining 41
'
&
$
%
Clustering (continued)
• Recognizing shapes:distance within cluster may exceed distance between clusters.
x
y
r rr r
r rr r
r r
r r rr r r
r r rr
r rr r
r rr r
r r
rrrrrrrrrr
rrrrrrrrrr
rrrrrrrrrr r rr r
rr r
r rr
r rr r
rr r
r rr
Jef Wijsen Data Warehousing and Data Mining 42
'
&
$
%
6 Next Generation Data Warehousing/Mining
6.1 The Future of Data Warehousing
1. Decision support systems will become more pro-active.
2. Future systems will be specialized into a specific business sector,e.g., petrochemical industry.
3. The data warehouse will be ever more extended with backgroundknowledge from external data sources, such as
• information on the Web,
• information from geographical information systems.
Jef Wijsen Data Warehousing and Data Mining 43
'
&
$
%
Web Warehousing. . .
E.g., analyzing information on Web pages of competitors: pricingpolicies, promotional events, price variations, frequency ofpromotions,. . .
Difficulties:
• No historical information. Data may be out of date. How did theprice of product X change at competitor’s site?
• Web sites are autonomous; they can change content andstructure at any one time.
• Which Web sites are credible/authoritative?
• Poor productivity when searching. Which online store sellsproduct X at the lowest price?
Jef Wijsen Data Warehousing and Data Mining 44
'
&
$
%
6.2 The Future of Data Mining
• Specialization into specific business sectors. E.g., pharmaceuticalindustry.
• Increased expressiveness. E.g., inductive logic programming(ILP).
• Paradigms for improved user interaction. E.g., inductivedatabases.
• “Hybrid” data mining techniques (collaboration andcompetition).
Jef Wijsen Data Warehousing and Data Mining 45
'
&
$
%
A Note on Expressiveness
ID Width Height Sides Class
a 2 4 4 standing
b 3 6 4 standing
c 8 10 3 standing
d 2 9 4 standing
e 9 1 4 lying
f 4 3 4 lying
g 7 6 3 lying
h 10 2 3 lying
a bc
¯¯¯¯
LL
LLL
d
Standing
ef
g¢¢¢
JJJ
h³³³ HH
Lying
Jef Wijsen Data Warehousing and Data Mining 46
'
&
$
%
Classical classification systemscompare attributes with con-stant values:If width > 3.5 and height
< 8 then lying
The following rule comparesattributes with each other:If width > height then
lying
-
6
width
height
ab
cd
ef
g
h-
6
width
height
ab
cd
ef
g
h��������
Jef Wijsen Data Warehousing and Data Mining 47
'
&
$
%
7 Important Players
All market analysts agree on a large growth of the data warehousingand data mining software market in the next years.
OLAP
The OLAP-market is strongly fragmented, without dominant marketleaders.
• All important database vendors (IBM, Informix Software,Microsoft, Oracle, Sybase) provide solutions for OLAP and datawarehousing.
• Other important players include Hyperion, Cognos,MicroStrategy, Business Objects.
Source: http://www.olapreport.com/.
Jef Wijsen Data Warehousing and Data Mining 48
'
&
$
%
Data Mining
Important data mining products include
• Clementine (SPSS),
• Enterprise Miner (SAS),
• Intelligent Miner (IBM),
See http://www.kdnuggets.com/.
Jef Wijsen Data Warehousing and Data Mining 49
'
&
$
%
8 Selected Literature
• Certain documents on the Web are more interesting than manytext books.
• See http://www.cs.utoronto.ca/˜mendel/ for:
– an overview of scientific research in data warehousing andOLAP;
– links to white papers written by commercial software vendors.
• J. Han and M. Kamber. Data Mining: Concepts and Techniques.Morgan Kaufmann, 2000.
• I. Witten and E. Frank. Data Mining. Practical MachineLearning Tools and Techniques with Java Implementations.Morgan Kaufmann, 2000.