Upload
dhanyaprasad8
View
230
Download
0
Embed Size (px)
Citation preview
7/28/2019 DBMS Data Mining
1/40
1
7/28/2019 DBMS Data Mining
2/40
Data Mining - Beer and Nappies
2
On Thursday nights people who buy diapers also tend to buy beer
7/28/2019 DBMS Data Mining
3/40
Introduction Data is growing at a phenomenal rate
Users expect more sophisticated information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
3
7/28/2019 DBMS Data Mining
4/40
Data Mining Data mining, the extraction of hidden predictive
information from large databases, is a powerful newtechnology with great potential to help companiesfocus on the most important information in their datawarehouses.
Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions.
4
7/28/2019 DBMS Data Mining
5/40
Data mining Data mining involves the use of sophisticated data
analysis tools to discover previously unknown, validpatterns and relationships in large data sets.
These tools can include statistical models,mathematical algorithms, and machine learningmethods (algorithms that improve their performanceautomatically through experience, such as neural
networks or decision trees). Consequently, data mining consists of more than
collecting and managing data, it also includes analysisand prediction.
5
7/28/2019 DBMS Data Mining
6/40
Data Mining Algorithm Objective: Fit Data to a Model
Descriptive
Predictive
Preference Technique to choose the best model
Search Technique to search the data Query
6
7/28/2019 DBMS Data Mining
7/40
Data Mining
7
Descriptive
Identify and describe groups of customers withcommon buying behavior
7/28/2019 DBMS Data Mining
8/40
Data MiningPredictive
Given a customers characteristics a model predictshow much the customer will spend on the next catalogorder.
Predicting likelihood (probability) a customerwould respond to an offer
8
7/28/2019 DBMS Data Mining
9/40
Data Mining Models and Tasks
9
7/28/2019 DBMS Data Mining
10/40
Data Mining Association
(purchasing a pen and purchasing paper),
Sequence or Path analysis
(birth of a child and purchasing diapers), Classification
(duct tape purchases and plastic sheeting purchases),clustering
Finding and visually documenting groups of previously
unknown facts, geographic location and brand preferences) forecasting (discovering patterns
from which one can make reasonable predictions regardingfuture activities, such as
10
7/28/2019 DBMS Data Mining
11/40
Data MiningData Mining is Knowledge discovery using a
sophisticated blend of techniques from
traditional statistics,
artificial intelligence and
computer graphics.
Data mining is the process of semi-automatically
analyzing large databases to find interesting anduseful patterns
Data mining overlaps with machine learning,statistics, artificial intelligence and databases.
11
7/28/2019 DBMS Data Mining
12/40
Goals of Data Mining Explanatory: To explain some observed event or
condition.(Why sales of Maruti Swift has increased in Chennai).
Confirmatory: To confirm a hypothesis.(whether two-income families are more likely to buy familymedical coverage than single-income families)
Exploratory: To analyze data for new orunexpected relationships.(What spending patterns are likely to accompany creditcard fraud.)
12
7/28/2019 DBMS Data Mining
13/40
Issues in data mining Data quality,
which refers to the accuracy and completeness of thedata being analyzed.
Interoperability of the data mining
software and databases being used by differentagencies.
Mission creep, The use of data for purposes other than for which the
data were originally collected.
Privacy.
13
7/28/2019 DBMS Data Mining
14/40
Advanced forms of Data MiningWeb mining
Spatial Mining
Temporal Mining
14
7/28/2019 DBMS Data Mining
15/40
Web mining Crawlers
Robot (spider)
Focused crawler PageRank
backlinks
Personalization
15
7/28/2019 DBMS Data Mining
16/40
Spatial miningGoal: data mining on spatial data
Spatial selection may involve specialized selection
comparison operations: Near
North, South, East, West
Contained in
Overlap/intersect
16
7/28/2019 DBMS Data Mining
17/40
Temporal miningGoal: data mining for temporal data
Time Series
Pattern Detection Sequences
Temporal Association Rules
HR database
17
7/28/2019 DBMS Data Mining
18/40
Temporal Database Snapshot Traditional database
Temporal Multiple time points
18
7/28/2019 DBMS Data Mining
19/40
Types of Database (Temporal)Snapshot No temporal supportTransaction Time Supports time when
transaction inserted dataTimestampRange
Valid Time Supports time range when
data values are validBitemporal Supports both transaction
and valid time
19
7/28/2019 DBMS Data Mining
20/40
Database Searching vs. Data Mining Query
Well defined SQL
Data
Operational data
Output
Precise
Subset of database
20
Query
Poorly defined
No precise query language
Data
Not operational data
Output
Fuzzy
Not a subset of database
7/28/2019 DBMS Data Mining
21/40
Query Examples Database
Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in
the last month. Find all customers who have purchased milk
Data Mining Find all credit applicants who are poor credit risks.
(classification)
Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk.
(association rules)
21
7/28/2019 DBMS Data Mining
22/40
Data Mining vs. KDDKnowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.
22
7/28/2019 DBMS Data Mining
23/40
KDD Process
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format. Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results touser in meaningful manner.
23
7/28/2019 DBMS Data Mining
24/40
Data WarehousingA data warehouse is subject-oriented,integrated,time-variant, and nonvolatile collection of data Subject-oriented: Contains information regarding
objects of interest for decision support: Sales by region,by product, etc. Integrated: Data are typically extracted from multiple,
heterogeneous data sources (e.g., from sales, inventory,billing databases etc.).
Time-variant: Contain historical data, longer horizonthan operational system. Nonvolatile : Data is not (or rarely) directly
updated.
24
7/28/2019 DBMS Data Mining
25/40
Why build a data warehouseAccess to data from multiple sources, have a
comprehensive data collection.
Separate transactional and analysis systems: Improve
query response time (without slowing downtransaction processing)
Easy formulation of complex queries
Access to historical data (not in operational systems)
Improved data quality (fewer errors and missingvalues)
25
7/28/2019 DBMS Data Mining
26/40
Data Warehouse Back-End Tools Data extraction: Extract data from multiple,
heterogeneous, and external sources Data cleaning (scrubbing): Detect errors in the data
and rectify them when possible Data converting: Convert data from legacy or host
format to warehouse format Transforming: Sort, summarize, compute views, check
integrity, and build indices Refresh: Propagate the updates from the data sources
to the warehouse
26
D t b D t W h
7/28/2019 DBMS Data Mining
27/40
27
Database
Application Oriented(OLTP)
Used to run business Clerical User
Detailed data
Current up to date
Operational Data
Repetitive access bysmall transactions
Fast response time
(seconds)
Read/Update access
Relational Schema
Data Warehouse
Subject Oriented (OLAP)
Used to analyze business
Manager/Analyst Summarized and refined
Historical data
Integrated Data
Ad-hoc access usinglarge queries
Slow response time(minutes)
Mostly read access(batch update)
Star / Snowflake Schema
7/28/2019 DBMS Data Mining
28/40
On-Line Analytical Processing OLAP Front-end to the data warehouse. Allowing easy
data manipulation
Allows conducting inquiries over the data atvarious levels of abstractions
Fast and easy because some aggregations arecomputed in advance
No need to formulate entire query
OLAP uses data in multidimensional format (e.g.,data cubes) to facilitate query and response time
28
7/28/2019 DBMS Data Mining
29/40
Data Mining Vs. Data WarehouseData Mining: Applications of methods (algorithms) todiscover patterns in data.
Include some OLAP operations
OLAP: deductive process - testing existence of hypotheticalpatterns in data Good to explore the data and test hypotheses
Data Mining mostly refers to modeling underlyingdata
Uncovering patterns in data Potentially surprising patterns may arise
Data Mining methods may use data from a datawarehouse (when available)
29
7/28/2019 DBMS Data Mining
30/40
Data Mining + Data WarehouseData Warehousing provides theEnterprise with a memory
30
Data Mining provides theEnterprise with intelligence
7/28/2019 DBMS Data Mining
31/40
31
7/28/2019 DBMS Data Mining
32/40
MDDBMS Multidimensional data model emerged over the past
10-15 years
MDDBMS is the Rubik's Cube of databasemanagement systems
Focuses on analyzing the data, not recordingtransactions
Data is categorized as either facts with numericalmeasures, or as dimensions that characterize the fact
32
7/28/2019 DBMS Data Mining
33/40
MDDBMS Takes data from many sources, such as RDBMS, Legacy
System, etc
Data is physically stored on disk in a data structurethat is highly optimized for multidimensionalprocessing and fast retrieval
Storage is between 2 and 10 times more efficient over
RDBMS due to better indexing, compression andrepresentation of sparse data
33
7/28/2019 DBMS Data Mining
34/40
BenefitsQueries are simply a request to see pre-
existing data organized in a specific
fashion.Already highly organized, so therequested data is removed andreorganized
Stores information in the same waythat it is viewed (less datamanagement, and maintenance)
34
7/28/2019 DBMS Data Mining
35/40
The drawbacksNot the best solution for every problem
Works only on information with
interrelations
Database explosion with large amounts
of sparse data (calculating allrelationships can increase the databasesize dramatically).
35
7/28/2019 DBMS Data Mining
36/40
Example
MDDBMS are an important tool in KM,
36
7/28/2019 DBMS Data Mining
37/40
7/28/2019 DBMS Data Mining
38/40
7/28/2019 DBMS Data Mining
39/40
7/28/2019 DBMS Data Mining
40/40
Thank You