Data Warehousing
Naveed Iqbal Assistant ProfessorNaveed Iqbal, Assistant ProfessorNUCES, Islamabad
(Lecture Slides Week # 13)
Data Duplication EliminationData Duplication Elimination & BSN Method
Data Duplication
Why data duplicated?A data warehouse is created from heterogeneous sources, withheterogeneous databases (different schema / representation)of the same entity.The data coming from outside the organization owning theDWH, can have even lower quality data i.e. differentrepresentation for same entity, transcription or typographicalerrors.
Problems due to data duplicationData duplication, can result in costly errors, such as:p , y ,
False frequency distributionsIncorrect aggregates due to double countingDifficulty with catching fabricated identities by credit cardy g ycompanies.
3
Data Duplication: Non-Unique PK
N Ph N b C t N
• Multiple Customer Numbers
Name Phone Number Cust. No.M. Ismail Siddiqi 021.666.1244 780701M. Ismail Siddiqi 021.666.1244 780203M. Ismail Siddiqi 021.666.1244 780009
Bonus Date Name Department Emp. No.
• Multiple Employee Numbersp p
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 (HR) 3457642
Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM)Unable to analyze employee benefits trendsUnable to analyze employee benefits trends
( )
y p yy p y
4
Data Duplication: House Holding
Group together all records that belong to the same h h ldhousehold.
……… S. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Shiekh Ahad No. 440, Munir Rd, Lhr
……… Shiekh Ahed House # 440, Munir Road, Lahore
……… ………….… ………………………………
Why bother ?
5
Data Duplication: Individualization
Identify multiple records in each household which t th i di id lrepresent the same individual
……… M. Ahad 440, Munir Road, Lahore……… M. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
Maj Ahad 440 Munir Road Lahore……… Maj Ahad 440, Munir Road, Lahore
Address field is standardized. By coincidence ??
6
Overview of the Basic Concept
In its simplest form, there is an identifying attribute (orbi ti ) d f id tifi ticombination) per record for identification.
Records can be from single source or multiple sourcessharing same PK or other common unique attributes.
Sorting performed on identifying attributes andSorting performed on identifying attributes andneighboring records checked.
What if no common attributes or dirty data?What if no common attributes or dirty data?The degree of similarity measured numerically, differentattributes may contribute differently.
7
Basic Sorted Neighborhood (BSN) Method
Concatenate data into one sequential list of N recordsSteps 1: Create KeysSteps 1: Create Keys
Compute a key for each record in the list by extractingrelevant fields or portions of fieldsEffectiveness of this method highly depends on a properlyEffectiveness of this method highly depends on a properlychosen key
Step 2: Sort DataSort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1
Step 3: MergeMove a fixed size window through the sequential list ofrecords limiting the comparisons for matching records tog p gthose records in the windowIf the size of the window is w records then every newrecord entering the window is compared with the previous
1 dw-1 records.
8
BSN Method : Sliding Window....
Current windowof records w
Next windowof recordsw of records
.
.
.
9
BSN Method: Selection of Keys
Selection of KeysEffectiveness highly dependent on the key selected to sort the recordse.g. middle name vs. family nameA key is a sequence of a subset of attributes or sub-strings within theattributes chosen from the recordattributes chosen from the recordThe keys are used for sorting the entire dataset with the intention thatmatched candidates will appear close to each other
First Middle Address NID Key
Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
10
BSN Method: Problem with keys
Since data is dirty, so keys WILL also be dirty, andmatching records will not come together.
Data becomes dirty due to data entry errors or use ofabbreviations. Some real examples are as follows:
TechnologyT hTech.
Techno.Tchnlgy
Solution is to use external standard source files to validate thedata and resolve any data conflicts.
11
BSN Method: Problem with keys
If contents of fields are not properly ordered, similar records will NOTfall in the same window.
No Name Address Gender1 N Jaffri Syed No 420 Street 15 Chaklala 4 Rawalpindi M
Example: Records 1 and 2 are similar but will occur far apart.
1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
Solution is to TOKENize the fields i.e. break them further. Use the tokensin different fields for sorting to fix the error.Example: Either using the name or the address field records 1 and 2 willfall close
No Name Address Gender1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
fall close.
2 Syed Noman 420 4 Rwp Scheme M3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
12
BSN Method: Matching Candidates
Merging of records is a complex inferential process.ExampleExample--11:: Two persons with names spelled nearly but notidentically, have the exact same address. We infer they are sameperson i.e. NomaNoma Abdullah and NomanNoman Abdullah.ExampleExample 22:: Two persons have same National ID numbers butExampleExample--22:: Two persons have same National ID numbers butnames and addresses are completely different. We infer sameperson who changed his name and moved or the recordsrepresent different persons and NID is incorrect for one of themrepresent different persons and NID is incorrect for one of them.UseUse ofof furtherfurther informationinformation suchsuch asas age,age, gendergender etcetc.. cancan alteralter thethedecisiondecision..ExampleExample--33:: NomaNoma-F and NomanNoman-M we could perhaps infer thatpp p pNoma and Noman are siblings i.e. brothers and sisters. NomaNoma-30and NomanNoman-5 i.e. mother and son.
13
Introduction to Data QualityIntroduction to Data Quality Management (DQM)
What is Quality?
InformallyS thi b tt th th i th f hi hSome things are better than others i.e. they are of higherquality. How much “better” is better?Is the right item the best item to purchase? How about after the
h ?purchase?What is quality of service? The bank example
Formally“Quality is conformance to requirements” / “Degree ofexcellence”
Example:pQuality means meeting customer’s needs, not necessarily exceedingthem.Quality means improving things customers care about, because thatmakes their lives easier and more comfortable.
15
What is Data Quality?
What is Data?
Height = 5’8”Weight = 160 lbs
Emp ID = 440
Muhammad Khan
Gender = MaleAge = 35 yrs
Emp_ID = 440
All d t i b t ti f thi lAll data is an abstraction of something real.Intrinsic Data QualityEl t i d ti f litElectronic reproduction of reality.
Realistic Data QualityD f tilit l f d t t b iDegree of utility or value of data to business.
16
Data Quality & Organizationsy g
Intelligent Learning Organization:Intelligent Learning Organization:High-quality data is an open, sharedresource with value-adding processesresource with value adding processes.
Th D f ti l L i O i tiThe Dysfunctional Learning Organization:Low-quality data is a proprietary resource
ith t ddiwith cost-adding processes.
17
Orr’s Laws of Data Quality
Law #1 - “Data that is not used cannot be correct!”
Law #2 - “Data quality is a function of its use, not itscollection!”
Law #3 - “Data will be no better than its most stringentuse!”use!
Law #4 - “Data quality problems increase with the age ofth t !”the system!”
Law #5 – “The less likely something is to occur, the moretraumatic it will be when it happens!”
18
Total Quality Control / Management (TQM)
Philosophy of involving all concepts forp y g psystematic and continuous improvement.
It is customer oriented Why?It is customer oriented. Why?
TQM incorporates the concept of productp p pquality, process control, quality assurance, andquality improvement.
Quality assurance is NOT Quality improvement.
19
Cost of Fixing Data Quality
g qu
ality
f ach
ievi
ng
Exponential risein cost
Cos
t o
in cost
Lowest Quality Highest quality
Defect minimization is economical.D f t li i ti i iDefect elimination is very very expensive.
20
Cost of Data Quality Defects
Controllable CostsRecurring costs for analyzing, correcting, andpreventing data errors
Resultant CostsInternal and external failure costs of business /opportunities missed
E i t & T i i C tEquipment & Training Costs
21
Characteristics or Dimensions of Data Quality
Data QualityCharacteristic Definition
Accuracy Qualitatively assessing lack of error, high accuracy corresponding to small error.
Completeness The degree to which values are present in the attributes that require ththem.
Consistency A measure of the degree to which a set of data satisfies a set of constraints.
Timeliness A measure of how current or up to date the data isTimeliness A measure of how current or up-to-date the data is.
Uniqueness The state of being only one of its kind or being without an equal or parallel.
Interpretability The extent to which data is in appropriate languages, symbols, and e p e b y e e e o w c d s pp op e gu ges, sy bo s, dunits, and the definitions are clear.
Accessibility The extent to which data is available, or easily and quickly retrievable
Objectivity The extent to which data is unbiased, unprejudiced, and impartial
22
Completeness vs. Accuracy
95% accurate and 100% completeOR
100% accurate and 95% complete
Which is better?
Depends on data quality (Depends on data quality (ii) tolerances, ) tolerances, the (ii) corresponding application and the (iii) cost the (ii) corresponding application and the (iii) cost of achieving that data quality vs the (iv) business of achieving that data quality vs the (iv) business of achieving that data quality vs. the (iv) business of achieving that data quality vs. the (iv) business
value.value.
23
Data Quality Management Process
Establish TDQMEnvironment
Scope Data Quality Projects &Develop Implementation Plans
Evaluate Data QualityManagement Methods
Implement Data Quality Projects(Define, Measure, Analyze, Improve)
24
Data Quality Management Process
Establish Data Quality Managementy gEnvironment• Information System Project Managers• Development Professionals• Functional users of legacy informationg y
systems with domain knowledge• IS developers know solutions but don’t
know how and where to modify
25
Data Quality Management Process y g
Scope Data Quality Projects & DevelopImplementation Plans
• Task Summary: Project goals, scope, and potentialbenefitsbenefits
• Task Description: Describe data quality analysis tasks• Project Approach: Summarize tasks and tools used to
provide a baseline of existing data qualityprovide a baseline of existing data quality• Schedule: Identify task start, completion dates, and project
milestonesR I l d t t d ith t l i iti• Resources: Include costs connected with tools acquisition,labor hours (by labor category), training, travel, and otherdirect and indirect costs
26
Data Quality Management Process
Implement Data Quality Projects (Define,Measure, Analyze, Improve)
• Plan / Define: Identify functional user DQ requirementsand establish DQ metricsand establish DQ metrics
• Do / Measure: Conformance to current business rules anddevelop exception reportsCheck / Analyze: Verify validate and assess poor DQ• Check / Analyze: Verify, validate, and assess poor DQcauses. Define improvement opportunities
• Act / Improve: Select/prioritize DQ improvementopportunities i e data entry procedures updating dataopportunities i.e. data entry procedures, updating datavalidation rules, and/or company data standards.
27
Data Quality Management Process
Evaluate Data Quality Managementy gMethods• Modifying existing methods of DQ management
• Determining if DQ projects have helped toachieve demonstrable goals and benefits?achieve demonstrable goals and benefits?
• Evaluating and assessing DQ work as, it is not aEvaluating and assessing DQ work as, it is not aprogram, but a new way of doing business
28
How to improve Data Quality?
The four categories of Data QualityImprovement
ProcessSystemPolicy & ProcedureData Design
29
Quality Management Maturity Grid
CMM Level-1Uncertainty
CMM Level-2AwakeningAwakening
CMM Level-3EnlightenmentEnlightenment
CMM Level-4Wisdom
CMM Level-5Certainity
30
Misconceptions on Data Quality
You Can Fix DataProblem NOT in data, but how it was used.It is NOT a one time process.Buying a cleansing tool is NOT the solutionBuying a cleansing tool is NOT the solution.Some live with the problem, cant afford the tool.
D t Q lit i IT P blData Quality is an IT ProblemIt is the company problem.Define the metrics of quality.Define the metrics of quality.Business has to strike a balance between qualityand ROI.J i t b i d IT ff tJoint business and IT effort.
31
Misconceptions on Data Quality
(All) Problem is in the Data Sources or Data EntryNOT th l blNOT the only problem.Systems could be responsible, but actually it is the metrics.Two divisions using different codes for same entity.N d t t k t h k d t f ti tNeed to track, trace, check data from creation to usage.
The Data Warehouse will provide a single source oftruth
In ideal world it is indeed true.In real world may be multiple data warehouses, data marts,external sources i.e. silos of data resulting in multiple sourcesof “truth”.Even with single source of truth, if transformations andinterpretations are different, an issue.
32