Upload
raj-endran
View
1
Download
0
Embed Size (px)
DESCRIPTION
data Mining-Applications, Issues
Citation preview
APPLICATIONS OF DATA MINING ISSUES IN DATA MINING
Applications:
Financial Data Analysis Retail Industry Telecommunication Industry Biological Data Analysis Other Scientific Applications Intrusion Detection
Financial Data Analysis: Financial Data
Collected from Banks and Financial Institutions Usually complete and reliable
Design and Construction of data Warehouses for multi-dimensional data analysis and mining Analysis Changes by month, by region, by
sectorand max, min, total, average, trend etc. Characteristic and Comparative analysis, Outlier
Analysis
Loan payment and customer credit policy analysis Feature Selection and attribute relevance ranking
(Debt ratio, credit history, income, education level )
Loan granting policy can be adjusted Low risk Customers are granted loans
Classification and Clustering of customers for
targeted marketing Customer group identification Multidimensional clustering techniques Can associate new customer with existing groups
Detection of money laundering and financial crimes Data from several sources integrated Data Analysis tools can be used to detect unusual
patterns Data Visualization tools, Linkage Analysis tools Classification tools, Clustering tools Outlier Analysis tools
Retail Industry:
Sales Data, Customer Shopping history, Goods Transportation, E-Commerce
Mining can help to Identify buying behaviour, discover shopping
trends Improve the quality of customer service, retain
customers Design and Construction of data warehouses
Several ways to design a warehouse Entities involved: Sales, Customers,
Employers, Goods transportation Preliminary data mining exercises can help to
guide the design process Dimensions and levels to involve and pre-
processing to be done
Multi-dimensional analysis of sales, customers, products, time and region Multi-feature data cube Visualization tools
Analysis of effectiveness of sales campaigns Compare sales and transaction volume Multidimensional analysis
Compare sales amount, number of transactions containing same items before and after the campaign
Association Analysis Identify items likely to be purchased together
Customer Retention Customer loyalty and trends
Sequential pattern mining Adjust pricing strategy and goods range
Purchase recommendation and cross-reference of items Recommender Systems Sales promotion by displaying deal information in
association with items of interest
Telecommunication Industry:
Computer and Web data transmission, fax, Mobile phone, Telephone services
Multidimensional analysis of telecommunication data Helps to identify and compare the data traffic,
System work load, Resource usage, User Group Behavior, Profit..
Time-of-day usage patterns
Fraudulent pattern analysis Identify fraudulent users and atypical usage
patterns Illegal Customer account access Automatic Dial-out equipment Switch and route congestion patterns
Multidimensional association and sequential pattern analysis Usage patterns for a set of communication
services by customer group, time of day Sales Promotion
Mobile Telecommunication Services Spatio-temporal data mining
Use of visualization tools
Biomedical and DNA Data Analysis:
Research in DNA Analysis has led to Development of new drugs Cancer therapies Human genome study Discovery of genetic causes for many diseases
Genome Research Study of DNA Sequences Adenine, Cytosine, Guanine, Thymine 1,00,000 genes each has hundreds of
nucleotides can be combined in a number of ways
Identifying Gene Sequence patterns is challenging
Semantic Integration of Heterogeneous, distributed genome databases Highly distributed generation and use of DNA
data Integrated data warehouses and distributed
federated databases Efficient Data Cleaning and Integration methods
Similarity Search and Comparison among DNA Sequences Gene sequences isolated from healthy and
diseased tissues Compare frequently occurring patterns in each
class Help to identify the genetic factors of the disease
and immune factors Non-numeric nature of data poses difficulties
Association Analysis: Identification of co-occurring gene sequences Diseases triggered by a combination of genes
acting together Association analysis helps to detect the kinds of
genes that may co-occur Study interactions and relationships between them
Path Analysis: Linking genes to different stages of disease development Different genes become active at different stages
of the disease Develop drug interventions that target specific
stages
Visualization tools and genetic data analysis Complex Gene structures Graphs, trees,
Cuboids and visualization tools Better Understanding and support interactive data
exploration
Intrusion Detection: Intrusions
Any set of actions that threaten the integrity, availability, or confidentiality of a network resource
Misuse detection: use patterns of well-known attacks to identify intrusions Signatures Must be updated Classification based on known intrusions E.g., three consecutive login failures: password
guessing.
Anomaly detection: use deviation from normal usage patterns to identify intrusions Any significant deviations from the expected
behavior are reported as possible attacks
Data Mining Algorithms Misuse detection
training data labeled normal / intrusion Classifier can be used to detect known
intrusions Classification algorithms, Association rule
mining Anomaly detection
Builds models of normal behavior and detects significant deviations
Supervised normal training data Unsupervised no information about training
data Classification, clustering
Association and Correlation Analysis Finds relationships between system attributes
describing the network data Helps in selection of useful attributes
Analysis of Stream data Transient and dynamic nature of intrusions An event maybe normal on its own but malicious
when viewed as a part of a sequence Distributed Data Mining
Analysis of data from several locations Visualization and Querying tools
Data Mining in other Scientific Applications:
Old Scenario: Small, homogeneous data sets Formulate hypothesis, build model, evaluate
results
Current Scenario: High-dimensional data, stream
data, heterogeneous data (spatial, temporal) Collect and store data, mine for new hypotheses,
confirm with data or experimentation
Vast amounts of data have been collected from Scientific domains Climate and ecosystem modeling, Chemical
engineering, fluid dynamics, structural mechanics
Data Warehouses and data preprocessing Scientific applications methods are needed for
integrating data from heterogeneous sources (Geospatial data warehouse) and identifying events (Climate and Ecosystem data)
Mining complex data types Scientific data Semi-structured and unstructured Multimedia and Spatial data
Graph-based mining Labeled graphs capture spatial, topological,
geometric and other relational characteristics present in scientific data
Nodes objects to be mined; edges relationships
Scalable and efficient mining methods are needed
Visualization tools and domain specific knowledge High level GUIs and visualization tools are
needed Integrated with existing domain-specific systems
and database systems
Issues in Data Mining:
Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple
levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data
mining Expression and visualization of data mining
results Handling noise and incomplete data Pattern evaluation
Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous
databases and global information systems (WWW)
Performance and scalability Efficiency and scalability of data mining
algorithms Parallel, distributed and incremental mining
methods
APPLICATIONS OF DATA MINING ISSUES IN DATA MININGApplications: Financial Data Analysis Retail Industry Telecommunication Industry Biological Data Analysis Other Scientific Applications Intrusion Detection
Financial Data Analysis: Financial Data Collected from Banks and Financial Institutions Usually complete and reliable
Design and Construction of data Warehouses for multi-dimensional data analysis and mining Analysis Changes by month, by region, by sectorand max, min, total, average, trend etc. Characteristic and Comparative analysis, Outlier Analysis
Loan payment and customer credit policy analysis Feature Selection and attribute relevance ranking (Debt ratio, credit history, income, education level ) Loan granting policy can be adjusted Low risk Customers are granted loans
Classification and Clustering of customers for targeted marketing Customer group identification Multidimensional clustering techniques Can associate new customer with existing groups
Detection of money laundering and financial crimes Data from several sources integrated Data Analysis tools can be used to detect unusual patterns Data Visualization tools, Linkage Analysis tools Classification tools, Clustering tools Outlier Analysis tools
Retail Industry: Sales Data, Customer Shopping history, Goods Transportation, E-Commerce Mining can help to Identify buying behaviour, discover shopping trends Improve the quality of customer service, retain customers
Design and Construction of data warehouses Several ways to design a warehouse Entities involved: Sales, Customers, Employers, Goods transportation
Preliminary data mining exercises can help to guide the design process Dimensions and levels to involve and pre-processing to be done
Multi-dimensional analysis of sales, customers, products, time and region Multi-feature data cube Visualization tools
Analysis of effectiveness of sales campaigns Compare sales and transaction volume Multidimensional analysis Compare sales amount, number of transactions containing same items before and after the campaign
Association Analysis Identify items likely to be purchased together
Customer Retention Customer loyalty and trends Sequential pattern mining
Adjust pricing strategy and goods range
Purchase recommendation and cross-reference of items Recommender Systems Sales promotion by displaying deal information in association with items of interest
Telecommunication Industry: Computer and Web data transmission, fax, Mobile phone, Telephone services Multidimensional analysis of telecommunication data Helps to identify and compare the data traffic, System work load, Resource usage, User Group Behavior, Profit.. Time-of-day usage patterns
Fraudulent pattern analysis Identify fraudulent users and atypical usage patterns Illegal Customer account access Automatic Dial-out equipment Switch and route congestion patterns
Multidimensional association and sequential pattern analysis Usage patterns for a set of communication services by customer group, time of day Sales Promotion
Mobile Telecommunication Services Spatio-temporal data mining
Use of visualization tools
Biomedical and DNA Data Analysis: Research in DNA Analysis has led to Development of new drugs Cancer therapies Human genome study Discovery of genetic causes for many diseases
Genome Research Study of DNA Sequences Adenine, Cytosine, Guanine, Thymine 1,00,000 genes each has hundreds of nucleotides can be combined in a number of ways Identifying Gene Sequence patterns is challenging
Semantic Integration of Heterogeneous, distributed genome databases Highly distributed generation and use of DNA data Integrated data warehouses and distributed federated databases Efficient Data Cleaning and Integration methods
Similarity Search and Comparison among DNA Sequences Gene sequences isolated from healthy and diseased tissues Compare frequently occurring patterns in each class Help to identify the genetic factors of the disease and immune factors Non-numeric nature of data poses difficulties
Association Analysis: Identification of co-occurring gene sequences Diseases triggered by a combination of genes acting together Association analysis helps to detect the kinds of genes that may co-occur Study interactions and relationships between them
Path Analysis: Linking genes to different stages of disease development Different genes become active at different stages of the disease Develop drug interventions that target specific stages
Visualization tools and genetic data analysis Complex Gene structures Graphs, trees, Cuboids and visualization tools Better Understanding and support interactive data exploration
Intrusion Detection: Intrusions Any set of actions that threaten the integrity, availability, or confidentiality of a network resource
Misuse detection: use patterns of well-known attacks to identify intrusions Signatures Must be updated Classification based on known intrusions E.g., three consecutive login failures: password guessing.
Anomaly detection: use deviation from normal usage patterns to identify intrusions Any significant deviations from the expected behavior are reported as possible attacks
Data Mining Algorithms Misuse detection training data labeled normal / intrusion Classifier can be used to detect known intrusions Classification algorithms, Association rule mining
Anomaly detection Builds models of normal behavior and detects significant deviations Supervised normal training data Unsupervised no information about training data Classification, clustering
Association and Correlation Analysis Finds relationships between system attributes describing the network data Helps in selection of useful attributes
Analysis of Stream data Transient and dynamic nature of intrusions An event maybe normal on its own but malicious when viewed as a part of a sequence
Distributed Data Mining Analysis of data from several locations
Visualization and Querying tools
Data Mining in other Scientific Applications: Old Scenario: Small, homogeneous data sets Formulate hypothesis, build model, evaluate results
Current Scenario: High-dimensional data, stream data, heterogeneous data (spatial, temporal) Collect and store data, mine for new hypotheses, confirm with data or experimentation
Vast amounts of data have been collected from Scientific domains Climate and ecosystem modeling, Chemical engineering, fluid dynamics, structural mechanics
Data Warehouses and data preprocessing Scientific applications methods are needed for integrating data from heterogeneous sources (Geospatial data warehouse) and identifying events (Climate and Ecosystem data)
Mining complex data types Scientific data Semi-structured and unstructured Multimedia and Spatial data
Graph-based mining Labeled graphs capture spatial, topological, geometric and other relational characteristics present in scientific data Nodes objects to be mined; edges relationships Scalable and efficient mining methods are needed
Visualization tools and domain specific knowledge High level GUIs and visualization tools are needed Integrated with existing domain-specific systems and database systems
Issues in Data Mining: Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation
Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW)
Performance and scalability Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods