Chapter 1 Introduction to Data Mining

UNIT 1: DATA MINING

BOJAMMA KK LECTURER, DEPARTMENT OF BCA

CAUVERY DEGREE COLLEGE, GONIKOPPAL. 1

Chapter 1

Introduction to Data Mining

Introduction

Data Mining is a set of method that applies to large and complex databases. This is to eliminate the

randomness and discover the hidden pattern. We use data mining tools, methodologies, and

theories for revealing patterns in data.

Data mining has become such an important area of study. In 1960s statisticians used the terms

“Data Fishing” or “Data Dredging”. The term “Data Mining” appeared around 1990 in the database

community.

We use data mining techniques for a long process of research and product development. This

evolution was started when business data was first stored on computers.

Data Mining

• Data mining is the process of sorting through large data sets to identify patterns and

establish relationships to solve problems through data analysis, interpretation and

evaluation and finally retrieve knowledge from data.

• Data mining tools allow enterprises to predict future trends.

• Data mining techniques support automatic exploration of data. Data mining attempts to

source out patterns and trends in the data and infers rules from those patterns.

• With these rules, the user will be able to support, review and examine decisions in some

related business or scientific area.

Data Mining Definitions:

1) Data mining is the search for the relationships and global patterns that exist in large

databases but are hidden among vast amounts of data, such as the relationship between

patient data and their medical diagnosis. This relationship represents valuable knowledge

about the database, and the objects in the database .

2) Data mining is the process of discovering meaningful, new correlation patterns and trends

by sifting through large amount of data stored in repositories, using pattern recognition

techniques as well as statistical and mathematical techniques.

KDD (Knowledge Discovery in Databases)

UNIT 1: DATA MINING



• Knowledge discovery in databases (KDD) is the process of discovering useful knowledge

from a collection of data.

• KDD was formalized in the year 1989.

• Data Mining is one among the many steps in the discovery of knowledge.

• This widely used data mining technique is a process that includes data preparation and

selection, data cleansing, incorporating prior knowledge on data sets and interpreting

accurate solutions from the observed results.

Stages of KDD

The stages of KDD starts with raw data and finishes with extracted knowledge.

1. DataSelection:

KDD isn’t prepared without human interaction. The choice of subset and the data set

requires knowledge of the domain from which the data is to be taken. Removing non-related

information elements from the dataset reduces the search space during the data mining

phase of KDD.

2. Pre-processing:

Databases do contain incorrect or missing data. During the pre-processing phase, the

information is cleaned. This warrants the removal of “outliers”, choosing approaches for

handling missing data fields, accounting for time sequence information, and applicable

normalization of data.

3. Transformation:

Within the transformation phase, it attempts to reduce the variety of data elements that

can be assessed while preserving the quality of the info. During this stage, information is

organized, changed in one type to some other (i.e. changing nominal to numeric) and new

or “derived” attributes are defined.

4. DataMining:

Here the information is subjected to one or several data-mining methods such as

regression, group, or clustering. The information mining of KDD usually requires repeated

iterative application of particular data mining methods. Different data-mining techniques

or models can be used depending on the expected outcome.

5. Evaluation:

The final step is documentation and interpretation of the outcomes from the previous steps.

Steps during this period might consist of returning to a previous step in the KDD approach

UNIT 1: DATA MINING



to refine the acquired knowledge, or converting the knowledge into a form clear for

the user.In this stage the extracted data patterns are visualized for further reviews.

6. Knowledge representation or Visualization:

Knowledge representation is defined as technique which utilizes visualization tools to

represent data mining results. Ex: Generate reports, Generate tables,

Generate discriminant rules, classification rules, characterization rules, etc.

Note:

KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,

new data can be integrated and transformed in order to get different and more appropriate

results.

Preprocessing of databases consists of Data cleaning and Data Integration.

KDD vs Data Mining

1. KDD is a multi step process that encourages conversion of data to useful information.

Data mining is one among the steps of Knowledge Discovery in Databases(KDD).

2. KDD refers to the overall process of discovering useful knowledge from data. It involves

the evaluation and possibly interpretation of the patterns to make the decision of what

qualifies as knowledge. It also includes the choice of encoding schemes, preprocessing,

sampling, and projections of the data prior to the data mining step.

Data mining refers to the application of algorithms for extracting patterns from data

without the additional steps of the KDD process.

DBMS vs Data Mining

DBMS

1) A DBMS (Database System Management) is a complete system used for direct digital

databases that allows the storage of content database creation / maintenance of data,

search and other functionalities.

2) DBMS, sometimes just called a database manager, is a collection of computer programs

that is dedicated to the management (that is to say the organization, storage and recovery)

of all databases that are installed in a system (that is to say the hard drive or network).

3) DBMS is a full-fledged system for housing and managing a set of digital databases.

4) Usually, the data used as input for the process of Data Mining are stored in databases.

5) Most popular commercial Database Management Systems are Oracle, DB2 and Microsoft

Access.

6) All these products provide a means for allocating different levels of privileges for different

users, making it possible for a DBMS to be controlled by a central administrator or simply

be allocated to several different people.

7) SQL is a query language that is used in delivery systems of relational database. DBMS

provide backup and other advantages as well too

Data Mining

UNIT 1: DATA MINING



1) Data Mining is a field in computer science, which deals with the extraction of information

previously unknown and interesting from raw data.

2) Users who are inclined to use statistics use data mining. They use statistical models to

search for patters that are hidden in the data.

3) Data miners find useful relations between data elements, which is good for business. For

example, it is used for different applications such as fraud detection, social network

analysis and marketing.

4) Data mining usually takes care of: classification, regression, clustering and the association.

5) Data mining system makes use of existing database in three ways : as loosely coupled, as

tightly coupled and not using the database at all.

6) A majority of DM systems do not use any DBMS and have their own memory and storage

management. They treat the database simply as a repository from which data is expected

to be downloaded into their own memory structures, before the data mining algorithm

starts.

7) The second approach is to have a loosely–coupled DBMS. In this case, DBMS is used only

for storage and retrieval of data. The applications use a SQL select statement to retrieve

the set of records of interest from the database. This approach does not use the querying

capability provided by the DBMS.

8) In tightly coupled approach, the portions of the application programs are selectively pushed

to the database system to perform the necessary computation. Data are stored in the

database and all processing is done at the database end. Here the data mining application

goes where the data naturally reside.

However Data Mining is a technique or a concept in computer science, which deals with extracting

useful and previously unknown information from raw data. Most of the times, these raw data are

stored in very large databases. Therefore Data miners use the existing functionalities of DBMS to

handle, manage and even preprocess raw data before and during the Data mining process.

However, a DBMS system alone cannot be used to analyze data.

Difference between Data Mining and Database

Database Data mining

1. Database is the organized collection of data.

2. Most of the times, the raw data are stored

in very large databases.

3. A Database may contain different levels of

abstraction in its architecture. Typically, the

three levels: external, conceptual and

internal make up the database architecture.

1. Data mining is the analyzing data from

different information to discover useful

knowledge.

2. Data mining deals with extracting useful and

previously unknown information from raw

data.

3. The data mining process relies on the data

compiled in the data warehousing phase in

order to detect meaningful patterns.

Related Areas of Data Mining

UNIT 1: DATA MINING



Data mining research has drawn on a number of other fields. Let us review the relations of data

mining with some of the important areas.

1. Statistics

• Statistics form the core portion of data mining, which covers the entire process of

data analysis.

• Statistics is only about quantifying data. It provides the tools necessary for data

mining.

• Statistics help in identifying patterns that further help identi fy differences between

random noise and significant findings—providing a theory for estimating

probabilities of predictions and more.

• Thereby, both data mining and statistics, as techniques of data-analysis, help in

better decision-making.

2. Machine Learning

• Data mining is a cross-disciplinary field that focuses on discovering properties of data sets.

• There are different approaches to discovering properties of data sets. Machine learning is

one of them.

• Machine learning is a sub-field of data science that focuses on designing algorithms that

can learn from and make predictions on the data.

• Machine learning is inductive learning, where the system infers knowledge itself from

observing it environment and it includes supervised learning and unsupervised

learning methods.

• Supervised learning is learning from examples.

• Unsupervised methods actually start off from unlabeled data sets, so, in a way, they are

directly related to finding out unknown properties in them (e.g. clusters or rules).

3. Supervised Learning

• Supervised learning is a data mining task of inferring a function from labeled training data.

• The training data consist of a set of training examples.

• In supervised learning, each example is a pair consisting of an input object and the desired

output value.

• A supervised learning algorithm analyzes the training data and produces an inferred

function, which can used for mapping new examples. An optimal scenario will allow for the

algorithm to correctly determine the class labels for unseen instances.

https://www.quora.com/topic/Supervised-Learning

https://www.quora.com/topic/Unsupervised-Learning

https://www.quora.com/topic/Unsupervised-Learning

http://dataaspirant.com/2014/09/16/data-mining/

UNIT 1: DATA MINING



• For example, support vector machines, decision trees etc

4. Unsupervised Learning

• In data mining or even in data science world, the problem of an unsupervised learning task

is trying to find hidden structure in unlabeled data.

• Unsupervised learning is learning from observation and discovery.

• There is no training set or prior knowledge of the classes.

• All clustering algorithms come under unsupervised learning algorithms. Example: K – means

clustering, Hierarchical clustering etc.

5. Mathematical Programming

• Most of the major data mining tasks can be equivalently formulated as problems in

mathematical programming for which efficient algorithms are available.

• It provides new insight into the problems of data mining.

• For example, SVM is a Data Mining technique which uses mathematical

programming.

Data Mining Techniques

Two fundamental goals of data mining are: prediction and description.

• Prediction makes use of existing variables in the database in order to predict

unknown or future values of interest.

• Description focuses on finding patterns describing the data and the subsequent

presentation for user interpretation.

DM techniques can also be classified as:

• User-guided or verification driven data mining (verification model) :

In this process, the user makes a hypothesis and tests the hypothesis on the data

to verify its validity.

Ex: In a supermarket, with a limited budget for mailing campaign to launch a new

product, it is important to identify the section of the population most likely to buy

the new product. The user formulates the hypothesis to identify potential customers

and their common characteristics.

• Discovery driven or automatic discovery of rules (Discovery model) :

The discovery model is the system which automatically discovers important

information hidden in the data.

The data is sifted in search of frequently occurring patterns, trends and

generalizations about the data without intervention or guidance from the user.

Example of such a model is supermarket database.

The typical discovery driven tasks are: Discovery of association rules, Discovery of

classification rules, Clustering, Discovery of frequent episodes and Deviation

detection.

UNIT 1: DATA MINING



1. Classification:

• This analysis is used to retrieve important and relevant information about data, and

metadata.

• This data mining method helps to classify data in different classes.

• Classification consists of predicting a certain outcome based on a given input. In order to

predict the outcome, the algorithm processes a training set containing a set of attributes

and the respective outcome, usually called goal or prediction attribute. The prediction

accuracy defines how “good” the algorithm is.

• Classification technique is a supervised learning technique.

• For example, in a medical database the training set would have relevant patient information

recorded previously, where the prediction attribute is whether or not the patient had a heart

problem.

2. Clustering:

• Clustering analysis is a data mining technique to identify data that are like each other.

• This process helps to understand the differences and similarities between the data.

• Clustering is the grouping of a particular set of objects based on their characteristics,

aggregating them according to their similarities. The objectives of clustering are:

• To uncover natural groupings

• To initiate hypothesis about the data

• To find consistent and valid organization of the data

3. Regression:

• Regression analysis is the data mining method of identifying and analyzing the relationship

between variables.

• It is used to identify the likelihood of a specific variable, given the presence of other

variables.

4. Association Rules:

• This data mining technique helps to find the association between two or more Items. It

discovers a hidden pattern in the data set.

• Association rules are if/then statements that help uncover relationships between

seemingly unrelated data in a relational database or other information repository.

• An example of an association rule would be "If a customer buys a dozen eggs, he is 80%

likely to also purchase milk."

• An association rule has two parts, an antecedent (if) and a consequent (then). An

antecedent is an item found in the data. A consequent is an item that is found in

combination with the antecedent.

5. Frequent Episodes

https://searchsqlserver.techtarget.com/definition/relational-database

UNIT 1: DATA MINING



• Frequent episodes are the sequence of events that occur frequently, close to each other

and are extracted from the time sequences.

• Given a set R of event types, an event is a pair (A,t) where AϵR is an event type and t is an

integer , we can calculate the occurrence time of the event. An event sequence s of R is a

triple (Ts, Tc, S) , where Ts < Tc are integers . Ts is the starting time and Tc is the ending

time.

• S = { (A1, t1), (A2, t2),…. (An, tn)} is the ordered sequence of events.

• These episodes can be of three types. The serial episodes, parallel episodes and the non-

serial and non-parallel episodes

• The application include telecommunications, share market analysis and these are mainly

used for temporal data.

6. Deviation Detection

• Deviation detection is to identify outlying points in a particular data set, and explain

whether they are due to noise or other impurities being present in the data or due to trivial

reasons.

• By calculating the values of measures of current data and comparing them with previous

data as well as with the normative data , the deviations can be obtained.

• They can be applied in forecasting, fraud detection, customer retention etc.

7. Rough Set Techniques

• This methodology is concerned with the classification and analysis of imprecise, uncertain

or incomplete information and knowledge, and has been considered as one of the first non-

statistical approaches in data analysis .

• The fundamental concept behind RST is the approximation of lower and upper spaces of a

set. Every subset defined through upper and lower approximation is known as Rough Set.

• Rough set theory has found many interesting applications, especially in the areas of

machine learning, knowledge acquisition, decision analysis, knowledge discovery from

databases, expert systems, inductive reasoning and pattern recognition.

8. Neural Networks

• Neural networks were modeled after the cognitive processes of the brain. They are capable

of predicting new observations from existing observations.

• A neural network consists of interconnected processing elements also called units, nodes,

or neurons. The neurons within the network work together, in parallel, to produce an output

function(the network is robust and fault tolerant).

• Neural networks are best at identifying patterns or trends in data , they are well suited for

prediction or forecasting needs.

UNIT 1: DATA MINING



9. Support Vector Machines(SVM)

• SVM’s are learning machines that can perform binary classification and regression

estimation tasks. SVM’s are also recognized as efficient tools for data mining .SVM’s are

popular for two important factors:

• It minimizes the expected error rather than minimizing the classification error.

• It employs the duality theory of mathematical programming to get a dual problem that

admits efficient computational methods.

10. Genetic Algorithms

• It is a computational model suitable for solving complex optimization problems. A genetic

algorithm operates on set of individual elements and there is a set of biologically inspired

operators that can change these individuals.

• The algorithm operates through a simple cycle: Creation of population of strings, Evaluation

of each string, Selection of the best string, Genetic manipulation to create a new population

of strings.

• Advantage of using GA is:They can search complex and large amount of spaces and they

provide optimal solutions very fast.

OTHER MINING PROBLEMS

Data for data mining need not always be enterprise related data residing on a relational database.

Data sources are very diverse and appear in varied form. Data mining remains an important tool ,

irrespective of the forms or sources of data. The Data Mining problems for different types of data

are as follows:

1. Sequence mining:

• Sequence mining is concerned with mining sequence data.

• A sequence is a time-ordered list of objects, in which each object consists of an itemset,

with an itemset consisting of all items that appear (or were bought) together in a

transaction (or session).

• The purpose of sequence mining is to find frequent sequences that exceed a user-specified

support threshold.

• Mining sequential patterns has found a host of potential application domains, including

WWW, retailing, telecommunication etc.

2. Web mining:

• The world wide web has become a very popular medium of publishing. Though the web is

rich with information, gathering and making sense of this data is difficult because

publication on the web is largely unorganized.

• Web Mining is the use of data mining techniques to extract implicit , previously unknown

information from the massive collection of web documents and services.

• Web mining can be broken down into following subtasks: Resource finding, Information

selection and pre-processing, Generalization(discover general patterns at individual web

UNIT 1: DATA MINING



sites and across multiple sites) and atlast Analysis(validation and/or interpretation of

mined patterns).

3. Spatial data mining

• Spatial data mining is a special kind of data mining.

• Spatial data (location or geo-referenced data) are the data related to objects that occupy

space.

• Spatial data mining refers to the extraction of implicit knowledge, spatial relations, or other

patterns not explicitly stored in spatial databases.

• For example: “the land value of the cluster of residential-area to the east of ‘Cyber-tower’

is high” is an example for spatial data. Such information could be of value to investors,

home buyers and also to other domains such as satellite images, photographs etc.

4. Text mining:

• The term text mining or KDT( Knowledge Discovery in Text) was first proposed by Feldman

and Dagan in 1996.

• The term text mining, is used to cover many applications such as text clustering, exploratory

data analysis, finding patterns in text databases, information extraction and association

discovery.

Issues And Challenges In Data Mining

Data mining systems depend on databases to supply the raw input and this raises problems, such

as the databases tend to dynamic, incomplete , noisy and large. There can be other problems which

occur due to irrelevance of information stored. The difficulties in databases can be categorized as:

1) Limited information

A database is designed for a specific purpose and not for data mining. Some attributes

which are essential for knowledge discovery might not be present in the data. Hence

discovering knowledge about a given domain is difficult.

2) Noise and missing data

The data cleaning methods are required to handle the noise and incomplete objects while

mining the data regularities. If the data cleaning methods are not there then the accuracy

of the discovered patterns will be poor.

3) User interaction and prior knowledge

There needs to be more human-computer interaction and less emphasis on total

automation which supports both novice and expert users. So, developing a KDD tool both

interactive and iterative is must.

4) Uncertainty

This refers to the severity of error and the degree of noise in the data. Data precision is an

important consideration in a discovery system.

5) Size, updates and irrelevant fields

UNIT 1: DATA MINING



Databases tend to be large and dynamic, in that their contents are changing as information

is added, modified or removed. The problem of this, from the perspective of data mining ,

is how to ensure that the rules are up-to-date and consistent with the most current

information.

Data Mining Application Areas

The data mining applications can be naturally divided into two broad categories:

1. Business And E-Commerce Data

This is a major source category of data for data mining applications. Back-office, front-

office and network applications produce large amounts of data about business processes.

• Business Transactions

Modern business processes are consolidating with millions of customers and billions of

their transactions. Business enterprises require necessary information for their effective

functioning.

• Electronic Commerce

E-commerce produces large data sets in which the analysis of marketing patterns and risk

patterns is critical, which is important to meet the demands of online transactions.

2. Scientific, Engineering And Health-Care Data

Scientific data and metadata tend to be more complex . Scientists and engineers are

making increasing use of simulation and systems with application domain knowledge.

• Genomic data

Genomic sequencing and mapping efforts have produced a number of databases which are

accessible on the web. Finding relationships among these data sources is a fundamental

challenge for data mining.

• Sensor data

Remote sensing data is another source of voluminous data. Remote-sensing satellites and

a variety of other sensors produce large amounts of geo-referenced data.

• Simulation data

Simulation is now accepted as an important mode of science, supplementing theory and

experiment. Simulation also produces large amounts of data sets.

• Health Care data

Hospitals, health care organizations, insurance companies, and the concerned government

agencies accumulate large collections of data about patients and health related details

• Web data

The data on web like audio, video, numerical data is growing not only in volume but also in

complexity.

• Multimedia documents

There is a large amount of multimedia data present on the web.Extracting meaningful

information from the archives of multimedia data is becoming harder as the volume grows.

• Data web

UNIT 1: DATA MINING



Web is a collection of different kinds of data. HTML and XML are the emerging languages

for working with data in networked environments. As the infrastructure grows , data mining

is expected to be a critical enabling technology for the emerging data web.

Other Application Areas Of Data Mining

1) Risk Analysis

Given a set of current customers and an assessment of their risk worthiness, develop

description for various classes. Use these descriptions to classify a new customer into one

of the risk categories.

2) Targeted marketing

Given a database of potential customers and how they have responded to a solicitation up

n, develop a model of customers most likely to respond positively, and use the model for

more focussed new customer solicitation.

3) Customer Retention

Given a database of past customers and their behaviour prior to attrition , develop a model

of customers most likely to leave. Use the model for determining the best course of action

for these customers.

4) Portfolio management

Given a particular financial asset, predict the return on investment to determine the

inclusion of the asset in a folio or not.

5) Brand loyalty

Given a customer and the product he/she uses , predict whether the customer will switch

brands.

6) Banking

The application areas in banking are:

Detecting patterns of fraudulent credit card use.

Identifying loyal customers

Predicting customers likely to change their credit card affiliation

Determine credit card spending by customer groups

Data Mining Applications

• Communications – Data mining techniques are used in communication sector to predict

customer behavior to offer highly targetted and relevant campaigns.

• Insurance - Data mining helps insurance companies to price their products profitable and

promote new offers to their new or existing customers.

• Education - Data mining benefits educators to access student data, predict achievement

levels and find students or groups of students which need extra attention. For example,

students who are weak in maths subject.

• Manufacturing - With the help of Data Mining Manufacturers can predict wear and tear of

production assets. They can anticipate maintenance which helps them reduce them to

minimize downtime.

UNIT 1: DATA MINING



• Banking - Data mining helps finance sector to get a view of market risks and manage

regulatory compliance. It helps banks to identify probable defaulters to decide whether to

issue credit cards, loans, etc.

• Retail - Data Mining techniques help retail malls and grocery stores identify and arrange

most sellable items in the most attentive positions. It helps store owners to comes up with

the offer which encourages customers to increase their spending.

• Service Providers - Service providers like mobile phone and utility industries use Data

Mining to predict the reasons when a customer leaves their company. They analyze billing

details, customer service interactions, complaints made to the company to assign each

customer a probability score and offers incentives.

• E-Commerce - E-commerce websites use Data Mining to offer cross-sells and up-sells

through their websites. One of the most famous names is Amazon, who use Data mining

techniques to get more customers into their eCommerce store.

• Super Markets - Data Mining allows supermarket's develope rules to predict if their

shoppers were likely to be expecting.

• Crime Investigation - Data Mining helps crime investigation agencies to deploy police

workforce (where is a crime most likely to happen and when?), who to search at a border

crossing etc.

• Bioinformatics - Data Mining helps to mine biological data from massive datasets gathered

in biology and medicine.

Documents

Chapter 1 Introduction to Data Mining