27
Another Look at Data Mining Another Look at Data Mining Why do we mine? Why do we mine? What do we mine? What do we mine? How do we mine? How do we mine?

Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

Embed Size (px)

Citation preview

Page 1: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

Another Look at Data MiningAnother Look at Data Mining

Why do we mine?Why do we mine?

What do we mine?What do we mine?

How do we mine?How do we mine?

Page 2: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

What is Data MiningWhat is Data Mining

➤➤ Data mining discovers meaningful new Data mining discovers meaningful new correlations, hidden patterns and correlations, hidden patterns and relationships in your datarelationships in your data

➤➤ Conceptual descendent of statisticsConceptual descendent of statistics➤➤ Combines machine learning,statistics,and Combines machine learning,statistics,and

databasesdatabases➤➤ Knowledge discovery:process of building Knowledge discovery:process of building

and implementing a data mining solutionand implementing a data mining solution

Page 3: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Mining OverviewData Mining Overview➤➤ Knowledge Discovery in Databases, Knowledge Discovery in Databases, KDDKDD➤➤ No one data mining approachNo one data mining approach

➤➤ each tool viewed logically as application of clienteach tool viewed logically as application of client➤➤ Can reside on separate machine or in separate process and accessCan reside on separate machine or in separate process and access

data warehousedata warehouse

➤➤ RDBMS or proprietary OLAP embed data mining RDBMS or proprietary OLAP embed data mining capabilities deeply within engines to improve efficiency capabilities deeply within engines to improve efficiency and add extensionsand add extensions

➤➤ Requires a good foundation in terms of a data warehouseRequires a good foundation in terms of a data warehouse

Page 4: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Mining Overview (con’t)Data Mining Overview (con’t)

➤➤ Common algorithmic approachesCommon algorithmic approaches➤➤ association, affinity groupingassociation, affinity grouping➤➤ predicting, sequencepredicting, sequence--based analysisbased analysis➤➤ clustering clustering ➤➤ classificationclassification➤➤ estimationestimation

➤➤ Steps are:data selection, data Steps are:data selection, data transformation,data mining,result transformation,data mining,result interpretation.interpretation.

Page 5: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Strategic Benefit of Data MiningStrategic Benefit of Data Mining

➤➤ Direct MarketingDirect Marketing

➤➤ Trend AnalysisTrend Analysis

➤➤ Fraud detectionFraud detection

➤➤ Forecasting in Financial MarketsForecasting in Financial Markets

Page 6: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Why Data Mining Now?Why Data Mining Now?

➤➤ EconomicsEconomics➤➤ Unprecedented affordability of MIPS and MBUnprecedented affordability of MIPS and MB

➤➤ Parallel computingParallel computing➤➤ Enormous amounts of data can be processedEnormous amounts of data can be processed

➤➤ Popularity of data warehouses, data martsPopularity of data warehouses, data marts➤➤ Relatively clean data availableRelatively clean data available

Page 7: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Mining compared to Traditional AnalysisData Mining compared to Traditional Analysis

➤➤ Traditional AnalysisTraditional Analysis➤➤ Did sales of product X increase in Nov.?Did sales of product X increase in Nov.?

➤➤ Do sales of product X decrease when there is a Do sales of product X decrease when there is a promotion on product Y?promotion on product Y?

➤➤ Data mining is result orientedData mining is result oriented➤➤ What are the factors that determine sales of What are the factors that determine sales of

product X?product X?

Page 8: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Mining compared to Traditional Analysis (con’t)Data Mining compared to Traditional Analysis (con’t)

➤➤ Traditional; analysis is incrementalTraditional; analysis is incremental➤➤ Does billing level affect turnover?Does billing level affect turnover?

➤➤ Does location affect turnover?Does location affect turnover?

➤➤ Analyst builds model step by stepAnalyst builds model step by step

➤➤ Data Mining is result orientedData Mining is result oriented➤➤ Identify the factors and predict turnoverIdentify the factors and predict turnover

Page 9: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Steps in Data MiningSteps in Data Mining

➤➤ Data Manipulation Data Manipulation -- can be 70can be 70--80% of data 80% of data mining effortmining effort➤➤ data cleaningdata cleaning➤➤ missing valuesmissing values➤➤ data derivationdata derivation➤➤ merging datamerging data

➤➤ Defining a studyDefining a study➤➤ SupervisedSupervised--articulating goal, choosing dependent articulating goal, choosing dependent

variable or output and specifying data fieldsvariable or output and specifying data fields➤➤ UnsupervisedUnsupervised--group similar types of data or identify group similar types of data or identify

exceptionsexceptions

Page 10: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Steps in Data Mining (con’t)Steps in Data Mining (con’t)

➤➤ Reading the data and building the modelReading the data and building the model➤➤ model summarizes large amounts of data by model summarizes large amounts of data by

accumulating indicators accumulating indicators (frequencies,weight,conjunctions,differentiation)(frequencies,weight,conjunctions,differentiation)

➤➤ Understanding the modelUnderstanding the model➤➤ Know the particular modelKnow the particular model

➤➤ PredictionPrediction➤➤ Choose the best outcome based on historical dataChoose the best outcome based on historical data

Page 11: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

ModelsModels

➤➤ Genetic AlgorithmsGenetic Algorithms

➤➤ Neural NetsNeural Nets

➤➤ AgentsAgents

➤➤ StatisticsStatistics

➤➤ VisualizationVisualization

Page 12: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Genetic AlgorithmsGenetic AlgorithmsGenetic Algorithms

➤➤ Artificial intelligence system that mimics the evolutionary, Artificial intelligence system that mimics the evolutionary, survivalsurvival--ofof--thethe--fittest processes to generate increasingly fittest processes to generate increasingly better solutions to a problem.better solutions to a problem.

➤➤ Genetic algorithms produce several generations of solutions, Genetic algorithms produce several generations of solutions, choosing the best of the current set for each new generation.choosing the best of the current set for each new generation.

➤➤ ExamplesExamples➤➤ Generating human faces based on a few known features.Generating human faces based on a few known features.➤➤ Generating solutions to routing problems.Generating solutions to routing problems.➤➤ Generating stock portfolios.Generating stock portfolios.

Page 13: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

EVOLUTION IN GENETIC ALGORITHMSEVOLUTION IN GENETIC ALGORITHMS

➤➤ SELECTIONSELECTION -- or survival of the fittest. The or survival of the fittest. The key is to give preference to better outcomes.key is to give preference to better outcomes.

➤➤ CROSSOVERCROSSOVER -- combining portions of good combining portions of good outcomes in the hope of creating an even outcomes in the hope of creating an even better outcome.better outcome.

➤➤ MUTATIONMUTATION -- randomly trying combinations randomly trying combinations and evaluating the success (or failure) of the and evaluating the success (or failure) of the outcome.outcome.

Page 14: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Neural NetsNeural NetsNeural Nets➤➤Mathematical Model of the Way a Brain Mathematical Model of the Way a Brain

FunctionsFunctions➤➤Machine learning approach by which historical Machine learning approach by which historical

data can be examined for pattern recognitiondata can be examined for pattern recognition

➤➤A neural network simulates the human ability A neural network simulates the human ability to classify things based on the experience of to classify things based on the experience of seeing many examplesseeing many examples..

➤➤Pros Pros --Numerical Data Numerical Data

➤➤Cons Cons -- Opaque, Art or Science Opaque, Art or Science

::////wwwwww..aattttaarr..ccoomm//

Page 15: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

➤➤ExampleExample

➤➤Distinguishing different chemical Distinguishing different chemical compoundscompounds

➤➤Detecting anomalies in human tissue Detecting anomalies in human tissue that may signify diseasethat may signify disease

➤➤Reading handwritingReading handwriting

➤➤Detecting fraud in credit card useDetecting fraud in credit card use

Page 16: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Intelligent AgentsIntelligent Agents

➤➤ Software entities that carry out some set of Software entities that carry out some set of operations on behalf of user or program with some operations on behalf of user or program with some degree of autonomy and employ some knowledge degree of autonomy and employ some knowledge or representation of users goals and desires.or representation of users goals and desires.

➤➤ Some common characteristics Some common characteristics ➤➤ ability to communicate, cooperate and coordinate with ability to communicate, cooperate and coordinate with

other agentsother agents

➤➤ ability to act autonomously to achieve collective goal of ability to act autonomously to achieve collective goal of systemsystem

Page 17: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Intelligent Agents (con’t)Intelligent Agents (con’t)

➤➤ TasksTasks➤➤ automate repetitive tasksautomate repetitive tasks

➤➤ finding and filtering informationfinding and filtering information

➤➤ summarizing complex datasummarizing complex data

➤➤ Capability to learn and make Capability to learn and make recommendationsrecommendations

➤➤ Black box approach hides complexity and Black box approach hides complexity and allows for design of scalable systemallows for design of scalable system

Page 18: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

AI System

Expert Systems

Neural Networks

Genetic Algorithms

Intelligent Agents

Problem Type

Diagnostic or prescriptive

Identification, classification, prediction

Optimal solution

Specific and repetitive tasks

Based On

Strategies of experts

The human brain

Biological evolution

One or more AI techniques

Starting Information

Expert’s know-how

Acceptable patterns

Set of possible solutions

Your preferences

Comparison

Page 19: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

StatisticsStatisticsStatistics

➤➤ SAS, SPSSSAS, SPSS➤➤ Pros Pros -- Established technology Established technology ➤➤ Cons Cons -- Needs assumptions, nominal Needs assumptions, nominal

variable handling, management variable handling, management acceptance?acceptance?

Page 20: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

VisualizationVisualizationVisualization

➤➤ Data visualization refers to technologies Data visualization refers to technologies that support visualization of informationthat support visualization of information

➤➤ Includes Includes –– digital images, GIS, multidigital images, GIS, multi--dimensions, 3dimensions, 3--D presentations, animationsD presentations, animations

➤➤ http://www.http://www.almadenalmaden..ibmibm.com/.com/cscs/quest/dem/quest/demo/assoc/general.htmlo/assoc/general.html

Page 21: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Mining is Not a Silver BulletData Mining is Not a Silver Bullet

➤➤ It does not:It does not:➤➤ Find answers to questions you don’t askFind answers to questions you don’t ask

➤➤ Eliminate the need for domain experienceEliminate the need for domain experience

➤➤ Remove the need for data analysis skillsRemove the need for data analysis skills

Page 22: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Mining SoftwareData Mining Software

➤➤ http://www.kdnuggets.com/software/http://www.kdnuggets.com/software/

➤➤ http://www.attar.com/http://www.attar.com/ downloaddownload

➤➤ http://www.cs.bham.ac.uk/~anp/software.hthttp://www.cs.bham.ac.uk/~anp/software.htmlml software listingsoftware listing

Page 23: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Six Rules of Data Qualityby Ken Orr

Six Rules of Data Qualityby Ken Orr

1. Data that is not used cannot be correct for very long1. Data that is not used cannot be correct for very long

2. Data Quality in an information system is a function of 2. Data Quality in an information system is a function of its use, not its collectionits use, not its collection

3.Data quality will ultimately be no better than its most 3.Data quality will ultimately be no better than its most stringent usestringent use

4. Data quality problems tend to become worse with the 4. Data quality problems tend to become worse with the age of the systemage of the system

5. Less likely it is that some data element will change, 5. Less likely it is that some data element will change, more traumatic it will be when it finally does change.more traumatic it will be when it finally does change.

6. Information overload affects data quality6. Information overload affects data quality

Page 24: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data Quality SoftwareData Quality Software

➤➤ http://www.http://www.rulequestrulequest.com/.com/gritbotgritbot--info.htmlinfo.html

Page 25: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

General DW Data transformationGeneral DW Data transformation

➤➤ Resolve inconsistent legacy formatsResolve inconsistent legacy formats

➤➤ Strip out unwanted fieldsStrip out unwanted fields

➤➤ Interpret codes into textInterpret codes into text

➤➤ Combine data from multiple sources under Combine data from multiple sources under a common keya common key

➤➤ Find fields used for multiple purposes and Find fields used for multiple purposes and interpret fields value based on contextinterpret fields value based on context

Page 26: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Data transformation for Data MiningData transformation for Data Mining

➤➤ Flag normal, abnormal, out of bounds or Flag normal, abnormal, out of bounds or impossible factsimpossible facts

➤➤ Recognize random or noise values from Recognize random or noise values from context and mask outcontext and mask out

➤➤ Apply uniform treatment to NULL valuesApply uniform treatment to NULL values

➤➤ Flag fast records with changed statusFlag fast records with changed status

➤➤ Classify individual record by one of its Classify individual record by one of its aggregatesaggregates

Page 27: Another Another LLooookk at Data Mining - Latest …€¦ · Another Another LLooookk at Data Mining Why do we e ... ReRelativlatively clean data available. ... Flag fastag fast r

CS753 Dr. Mary Ann RobbertCS753 Dr. Mary Ann Robbert

Conclusion Conclusion

➤➤ For successful data mining:For successful data mining:➤➤ data analysis and mining goals must be data analysis and mining goals must be

identifies and formulatedidentifies and formulated

➤➤ appropriate data must be selected, cleaned and appropriate data must be selected, cleaned and prepared for queries and business analysisprepared for queries and business analysis

➤➤ http://www.http://www.rulequestrulequest.com/cubist.com/cubist--examples.html#BOSTONexamples.html#BOSTON

➤➤ http://www.http://www.almadenalmaden..ibmibm.com/.com/cscs/quest//quest/