Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
9 October 2003 1
CSCI6405 Fall 2003Dta Mining and Data Warehousing
Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Assistant: Christopher Jordan,Email: [email protected] Hours: TR, 1:30 - 3:00 PM
9 October 2003 2
Lectures OutlinePat I: Overview on DM and DW
1. Introduction (ch1) Ass1 Due: Sep 23 Tue2. Data preprocessing (ch3)
Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14
Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4)5. Classification data mining (ch7) Ass3: Oct 7 – Oct 216. Association data mining (ch6) Ass4: Oct 21 – Nov 57. Characterization data mining (ch5)8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)Project Presentations
Project Due: Dec 8
9 October 2003 3
3. DATA PREPROCESSING (Ch3)
Data Preprocessing (DPP) ConceptMajor Tasks of DPPA DPP Case StudySummary
9 October 2003 4
Why Is Data Preprocessing Important?
No quality data, no quality mining results!Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
9 October 2003 5
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility
9 October 2003 7
Why Data Preprocessing ?
Raw data have errors and inconsistencies (Data cleaning)
Data need to be integrated from different sources and a
unique format is needed (Data integration and
transformation)
Irrelevant data should be removed (Data reduction)
Domain knowledge should be added into the prepared
data (Discretization and concept hierarchy generation)
9 October 2003 9
Major Tasks of DPP (cont)
Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integrationIntegration of multiple databases, data cubes, or files
Data transformationNormalization and aggregation
Data reductionObtains reduced representation in volume but produces the same or similar analytical results
Data discretizationPart of data reduction but with particular importance, especially for numerical data
9 October 2003 10
Why data cleaning?
Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
e.g., occupation=“”noisy: containing errors or outliers
e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”
9 October 2003 11
Why is data dirty?Incomplete data comes from
n/a data value when collected
different consideration between the time when the data was collected and when it is analyzed.human/hardware/software problems
Noisy data comes from the process of datacollectionentrytransmission
Inconsistent data comes fromDifferent data sources
Functional dependency violation
9 October 2003 12
E.g. Data normalization for clustering mining
E.g., For clustering mining of a customer database: DB (Age, Income, Credit)
The distance between to data points:
d = ((C1_a1 - C2_a1)^2 + (C2_a2 - C2_a2)^2 + (C3_a1 – C3_a2)^2)^(1/2)
Age Income Credit Customer1: 32 40,000 10,000Customer2: 24 30,000 2,000
8 10,000 8,000Normalized: 1 1/1000 1/1000
8 10 8(rescaled) (rescaled)
If we scale all the attributes to the same order of magnitude we obtain reliable distance measure between the different records.
9 October 2003 13
Data mining task: - Mining clusters of clients for a magazine publisher database. - …
Data preparation for clustering: cleaned, integrated, normalized, numerical valued data, etc
Business Background: The publisher sells five types of magazine - on cars, houses, sports, music, and comics. The aim of the data mining is to find new, interesting clusters of clients in order to set up a marketing exercise. The business is interested in questions such as "What is the typical profile of a reader of a car magazine?’’, "Is there any correlation between an interest in cars and an interest in comics?" ...
A DPP Case Study
9 October 2003 14
The database should contain the records of subscription data of the magazines.
• It should be a selection of operational data from the publishers invoicing system and contains information about people who have subscribed to a magazine
• The records consist of: client number, name, address, date of subscription,and type of magazine
• In order to facilitate the DM process, a copy of this operational data is drawn and stored in a separate database (Refer Table 1)
1. Data Selection
9 October 2003 15
Client number Name Address Date
purchaseMagazine
purchased
23003 23003 23003 23009 23013 23019
Johnson Johnson Johnson Clinton King
Jonson
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 3 High Road
1 Downing Street
04-15-94 06-21-93 05-30-92 01-01-01 02-30-95 01-01-01
carmusiccomiccomicsportshouse
1. Original data
9 October 2003 16
Duplication of records:
In an operational client database some clients may be represented by several records, some of the possible causes may include:
- the result of negligence, such as people making typing errors
2. Data Cleaning: remove duplications
- clients moving from on place to another without notifying change of the address
- the cases in which people deliberately spell their names incorrectly or give incorrect information about themselves for avoiding a negative decision ... (Refer to Table 2)
9 October 2003 17
Client number Name Address Date
purchaseMagazine
purchased23003 23003 23003 23009 23013 23003
Johnson Johnson Johnson Clinton King
Johnson
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 3 High Road
1 Downing Street
04-15-94 06-21-93 05-30-92 01-01-0102-30-95 01-01-01
car music comic comic sports house
2. De-duplication
Client number Name Address Date
purchaseMagazine
purchased
23003 23003 23003 23009 23013 23019
Johnson Johnson Johnson Clinton King
Jonson
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 3 High Road
1 Downing Street
04-15-94 06-21-93 05-30-92 01-01-01 02-30-95 01-01-01
carmusiccomiccomicsportshouse
1. Original data
9 October 2003 18
E.g., The records Mr. Johnson and Mr. Jonson in the database. They have different client numbers but the same address, which is a strong indication that they are the same person.
This type of pollution will give a company the impression that it has more clients than in fact is the case.
Of course, we can never be sure of this, but a de-duplication algorithm using pattern analysis techniques could identify the situation and present it to a user to make a decision.
De-duplication:The duplicated records may be identified by a pattern recognition algorithm and then corrected.
De-duplication
9 October 2003 19
Domain inconsistency: Pollution was caused by wrong domain values which are not consistent with the definitions.
E.g. In the example table, date 01-01-01 means 1 January 1901 (the company did not even exist at that time).
In some databases, analysis shows an unexpected high number of people born on 11 November:
When people were forced to fill in a birth date on a screen and they either do not know or do not want to divulge it, they were inclined to type in `11-11-11'.
This kind of untrue random values can be disastrous in a data mining context.
If information is unknown (NULL) it should be represented as such in the database.
2. Data Cleaning: correct domain inconsistency
9 October 2003 20
Client number Name Address Date
purchaseMagazine
purchased23003 23003 23003 23009 23013 23003
Johnson Johnson Johnson Clinton King
Johnson
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 3 High Road
1 Downing Street
04-15-94 06-21-93 05-30-92
NULL 02-30-95 12-20-94
car music comic comic sports house
3. Domain consistency
Client number Name Address Date
purchaseMagazine
purchased23003 23003 23003 23009 23013 23003
Johnson Johnson Johnson Clinton King
Johnson
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 3 High Road
1 Downing Street
04-15-94 06-21-93 05-30-92 01-01-0102-30-95 01-01-01
car music comic comic sports house
9 October 2003 21
3. Data Integration (Enrichment)
Suppose that we have purchased extra information about our clients consisting of data of birth, income, amount of credit, and whether or not an individual owns a car or a house. (Refer to Table 4)
* You therefore have to make a deliberate decision either to overlook it or to delete it. A general rule states that any deletion of data must be a conscious decision, after a thorough analysis of the possible consequences.
9 October 2003 22
Client name
Date of birth Income Credit Car
ownerHouse owner
JohnsonClinton
04-13-7610-20-71
$18,500$36,000
$17,800$26,600
noyes
nono
4. Additional data available for enrichment
Client number Name Address Date
purchaseMagazine
purchased23003 23003 23003 23009 23013 23003
Johnson Johnson Johnson Clinton King
Johnson
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 3 High Road
1 Downing Street
04-15-94 06-21-93 05-30-92
NULL 02-30-95 12-20-94
car music comic comic sports house
3. Domain consistency
9 October 2003 23
Credit numb
erName Date of
birth Income CreditCar
owner
House
owner
AddressDate
purchase made
Magazine
purchased
23003 23003 23003 23009 23013 23003
Johnson Johnson Johnson Clinton
King Johnson
04-13-76 04-13-76 04-13-76 10-20-11
NULL 04-13-76
$18,500 $18,500 $18,500 $36,000 NULL
$18,500
$17,800 $17,800 $17,800 $26.600NULL$17,800
no no no yes
NULL no
no no no no
NULL no
1 Downing Street1 Downing Street1 Downing Street
2 Boulevard NULL
1 Downing Street
04-15-9406-21-93 05-30-92
NULL 02-30-9512-20-94
car music comic comicsports house
5. Enriched table
9 October 2003 24
4. Data Deduction
Remove the columns and rows which are not valuable to the DM process.In Table 6, the column NAME and the row with multiple NULL values are removed from the database.
In a real DM project, maybe most of the tables that are collected from the operational data and a lot of desirable data is missing, and most is possible to retrieve.
9 October 2003 25
Credit number
Date of birth Income Credit
Car owne
r
House owner Address
Date purchase
made
Magazine purchased
23003 23003 23003 23009 23003
04-13-76 04-13-76 04-13-76 10-20-1104-13-76
$18,500 $18,500 $18,500 $36,000 $18,500
$17,800 $17,800 $17,800 $26.600 $17,800
no no no yesno
no no no nono
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 1 Downing Street
04-15-94 06-21-93 05-30-92
NULL 12-20-94
car music comic comichouse
6. Table with column and row removed
Credit numb
erName Date of
birth Income CreditCar
owner
House
owner
AddressDate
purchase made
Magazine
purchased
23003 23003 23003 23009 23013 23003
Johnson Johnson Johnson Clinton
King Johnson
04-13-76 04-13-76 04-13-76 10-20-11
NULL 04-13-76
$18,500 $18,500 $18,500 $36,000 NULL
$18,500
$17,800 $17,800 $17,800 $26.600NULL$17,800
no no no yes
NULL no
no no no no
NULL no
1 Downing Street1 Downing Street1 Downing Street
2 Boulevard NULL
1 Downing Street
04-15-9406-21-93 05-30-92
NULL 02-30-9512-20-94
car music comic comicsports house
5. Enriched table
9 October 2003 26
4. Data Deduction (cont)
In some cases, especially fraud detection, lack of information can be a valuable indication of interesting patterns. Up to this point, the process phase has consisted of mainly simple SQL operations.
9 October 2003 27
5. Data transformationFor most of databases, the information provided is much too detailed to be used as input of data mining algorithms, such as
Apply the following coding steps:1. Address to region2. Birth date to age3. Divide income be 10004. Divide credit by 10005. Convert cars yes-no to 1-06. Convert purchase date to month numbers starting from 1990
Credit number
Date of birth Income Credit
Car owne
r
House owner Address
Date purchase
made
Magazine purchased
23003 23003 23003 23009 23003
04-13-76 04-13-76 04-13-76 10-20-1104-13-76
$18,500 $18,500 $18,500 $36,000 $18,500
$17,800 $17,800 $17,800 $26.600 $17,800
no no no yesno
no no no nono
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 1 Downing Street
04-15-94 06-21-93 05-30-92
NULL 12-20-94
car music comic comichouse
9 October 2003 28
7. An intermediate coding stage
Credit number Age Income Credit
Car owne
r
House owner Region Month of
purchaseMagazine
purchased
23003 23003 23003 23009 23003
20 20 20 25 20
18.5 18.5 18.5 36.0 18.5
17.8 17.817.826.617.8
0 0 0 1 0
0 0 0 0 0
1 1 1 1 1
52 42 29
NULL 48
car music comic comichouse
Credit number
Date of birth Income Credit
Car owne
r
House owner Address
Date purchase
made
Magazine purchased
23003 23003 23003 23009 23003
04-13-76 04-13-76 04-13-76 10-20-1104-13-76
$18,500 $18,500 $18,500 $36,000 $18,500
$17,800 $17,800 $17,800 $26.600 $17,800
no no no yesno
no no no nono
1 Downing Street 1 Downing Street 1 Downing Street
2 Boulevard 1 Downing Street
04-15-94 06-21-93 05-30-92
NULL 12-20-94
car music comic comichouse
6. Table with column and row removed
9 October 2003 29
Credit numbe
rAge Income Credit Car
ownerHouse owner Region Car
magazine House Sports Music Comic
2300323009
2025
18.536.0
17.826.6
01
00
11
10
10
00
10
11
8. The final table
9 October 2003 31
Summary
Data preparation is a big issue and most time cost process for both mining and warehousing
Data preparation includes
Data cleaning, integration, transformation, reduction, discretization, etc.
Many DPP tools have been developed but it is still an active research area because of the effort needed for