Dmc 1628 Data Warehousing and Data Mining

NOTES

1

DATA WAREHOUSING AND DATA MINING

ANNA UNIVERSITY CHENNAI

UNIT I

INTRODUCTION

INTRODUCTION

OBJECTIVES

Data mining is a promising and flourishing field now. It concerns about the extractionof models and patterns from the data stored in various archives of the business organizations.The extracted knowledge can be useful for organizations to discover new products,improvise the business processes and to develop decision support systems that will beuseful for the organizations. This chapter starts by showing the relationships of data miningwith fields like machine learning, Statistics and database technology. This chapter alsoprovides an overview of data mining tasks, components of data mining algorithm and Datamining architectures. This chapter also provides a holistic overview of data mining from allperspectives like database, statistical, algorithmic and application perspectives.

LEARNING OBJECTIVES

Explore the relationships among Data mining, Machine learning, Statisticsand Database technology

To study the nature of data and data sources To study the components of a data mining algorithm To explore the functionalities of data mining Exploring the Knowledge discovery in Database (KDD) cycle Exploring the architecture of a typical data mining system

1.1 NEED FOR DATA MINING

The business organizations use huge amount of data for their daily activities. So themodern business organizations have a large quantity of data and information. Also it iseasier now to capture, process, store, distribute and transmit the digital information. Thishuge amount of data is growing at a phenomenal rate. Approximately it is estimated thatthe data doubles at every 20 months.

DMC 1628

NOTES

2 ANNA UNIVERSITY CHENNAI

But full potential of this information and data is not realized fully due to two reasons

1. Information is scattered across different archive systems and most of theorganizations haven’t succeeded in integrating these sources fully.

2. There has been a lack of awareness about software tools that can help them tounearth the useful information among data.

Here comes the need for data mining. With the declining hardware and software cost,organizations feel the necessity of organizing the data in effective structures and to analyzethe data using flexible and scalable procedures that can help the organizations.

The motivations for the organizations for implementing data mining solutions are dueto the following factors

1. Advancement of Internet technology leads to ease of communication amongcustomers and organizations. Moreover this situation enables that data can beeffectively shared and distributed. Also new data can be added into the existingdata.

2. The competition among business organizations is so intense now. This fact forcesthe organizations to discover new products, improvise their process and servicesand so forth.

3. Hardware cost especially the storage cost is rapidly falling and there has beengrowth in availability of robust and flexible algorithms.

Therefore data mining is considered as an important field of study and more effort isspent nowadays by the business organizations for developing better tools in data mining.

1.2 NATURE OF DATA MINING

Data Mining uses the concepts of Statistics, Computer Science, Machine Learning,Artificial Intelligence (AI), Database Technology, and Pattern Recognition. Each field hasits own distinct way of solving the problems. Data mining is the resultant of combined ideasof diverse fields. Data mining original genesis is in the business. Like mining the earth, onegets into precious resources, it is often believed that unearthing of the data produce hiddeninformation that otherwise would have eluded the attention of the management.

Defining a field is a difficult task because of many conflicting definitions. Especially indata mining area, there are so many conflicting views of data mining because of its diversenature. Some view this as an extension of statistics and some view data mining as part ofmachine learning and a large community of researchers view data mining as the logicaladvancement of database technology.

A standard definition of Knowledge in Database is given by Fayyad, Piatesky-Shapiro,and Smyth (1996) as “Knowledge discovery in database is the nontrivial process ofselecting valid, novel, potentially useful and ultimately understandable patterns indata”

NOTES

3



Hence crucial points are valid, novel and understandable patterns. By valid, it ismeant that the patterns generally hold in all circumstances. Novel means the discoveredpattern is new and not encountered before. By understandable, it is meant that the patternis interpretable.

For example, looking up of the address of a customer in a telephone directory is nota data-mining task. But making a prediction of the sales of a particular item is the domainof data mining.

The definition that effectively captures the gist of data mining ( D.J.Hand) is “Datamining (Consists in) the discovery of interesting, unexpected, or valuable structuresin large data sets.”. This is shown as a diagram in Figure 1.1.

Figure 1.1 Nature of Data Mining Process

What is the difference between KDD and Data mining?

The term KDD means Knowledge Discovery in Databases. Many authors considerboth terms are synonymous. But it is better to remember that KDD is a whole process anddata mining is just the algorithmic component of the entire KDD process.

All these definitions infer that data mining assumes that there are treasures (or nuggets)under the pile of data. The aim and goal of data mining is by using the specific tools of datamining that can discover these treasures. This is not a new idea. Statistics has been usingexploratory data analysis and multivariate exploratory analysis for many years. But datamining is different because manual methods of statistics and statisticians to provide summaryand generation of reports are applicable or feasible for only small set. But organization hasgiga or even terra byte of data. For this huge pile of data, manual methods fail. Henceautomatic methods like data mining are of great use.

Data mining methods can be both descriptive and predictive. Descriptive methodsdiscover interesting patterns or relationships that are inherent in the data and predictivemethods predict the behavior of the models. Some of the areas that use data miningextensively are banking, finance, medicine, security, and telecommunications to understandthe behavior of the data. Data mining gives more stress to scalability of the number offeatures and instances, algorithms and architectures, automation of data handling.

DMC 1628

NOTES


1.3 DATA, INFORMATION, AND KNOWLEDGE

It is better to have a clear idea about the distinctions between data, information andknowledge. The knowledge pyramid is shown in Figure 1.2.

Figure 1.2 The Knowledge Pyramid.

DataAll facts are data. Data can be numbers or text that can be processed by a computer.

Today, organizations are accumulating vast and growing amounts of data. Billions of records,gigabytes of data are not usual in larger business organizations and data is stored in differentdata sources like flat files, databases or data warehouses and in different storage formats.

The types of data the organizations use may be operational or non-operational data.Operational data is the data that one needs to carry out normal business procedures andprocesses. For example daily sales data is operational data. But non-operational data isstrategic data reserved for taking decision. For example, macroeconomic data is a non-operational data. Strategic data are of great important to the organizations for takingdecisions. Hence, normally they are not updated or modified or deleted at will by the user.

Data can be a meta data also. Meta data defines other data. Data that is present inthe data dictionary, logical database design are examples of meta data.

InformationThe processed data is called information. This includes patterns, associations, or

elationships among data. For example the sales data can be analyzed to extract informationlike which is the fast selling product.Knowledge

Knowledge is condensed information. For example the historical patterns andfuture rends obtained in the above sales data can be called knowledge. For example the

WisdomIntelligence

Pyramid Chart

Knowledge (Condensedinformation)

Information (Processed Date)

Data (Mostly available as raw facts and

NOTES

5



behavior of the product selling and even the behavior of the customers buying behaviorhelp business organizations immensely. These are called knowledge because this is theessence of the data. Many organizations are data rich – but knowledge poor organizations.Unless knowledge is extracted, mostly data are of no use.

The Knowledge that is encountered in data mining process is of four types

1. Shallow knowledge: These types of knowledge can be easily stored and manipulatedusing query languages. For example using query languages like SQL one can extractinformation from a table.

2. Multidimensional knowledge: This is also similar to the shallow knowledge. Thedifference is multidimensional knowledge is associated with multidimensional dataand often use query languages like OLAP data cubes.

3. Hidden knowledge: This knowledge is present in the form of regular patterns orregularity in the data. This type of knowledge cannot be extracted using querylanguages like SQL. Data mining algorithms only can be used to extract this typeof knowledge.

4. Deep Knowledge: This type of knowledge is deep within and a sort of domainknowledge and direction should be given to extract this kind of knowledge. Sometimes even data mining algorithms are not successful in extracting this type ofknowledge.

Once the knowledge is extracted then it should be represented in some form so thatthe users can understand it. The types of knowledge structures one may encounter in datamining are

1. Decision table.

Decision table is a two dimensional table structure which gives the knowledge in theform of values. A sample decision table is shown in Table 1.1.

Table 1.1: Sample decision Table

Based on the decision table, one can infer some knowledge of the attributes.

2. Decision Tree

Decision tree is a flowchart type structure that has nodes and edges. The nodes areeither root or internal nodes. The root is a special node and internal nodes representconditions that are used to test the attribute values. The terminal nodes represent the classes.

DMC 1628

NOTES


3. Rules

Rules represent knowledge in the form of IF-THEN rules For example a rule can beextracted from the above table and can be represented as

IF Number of hours worked >= 40 then Result = Pass

A special type of rules can involve exceptions also. The rules are of the form - IFcondition THEN result EXCEPTION condition. This structure can captures the exceptionsalso.

4. Instances Space

The points can be represented in the instance space and can be grouped to conveyknowledge. The knowledge then can be expressed in the form of dendograms.

Intelligence and Wisdom

The applied knowledge is called Intelligence. For example, the knowledge extractedcan be put to use by the organizations in the form of intelligent systems or knowledge-based systems. Finally the wisdom represents the ultimate maturity of mind. Still it is a longway to go for computer programs to process wisdom.

1.4 DATA MINING IN RELATION TO OTHER FIELDS

DM methodology draws upon a collection of tools from mainly two areas – MachineLearning (ML) and Statistics. These tools include traditional statistical methods of multivariateanalysis, such as those of classification, clustering, contingency table analysis, principalcomponents analysis, correspondence analysis, multi-dimensional scaling, factor analysis,and latent structure analysis. DM also uses tools beyond statistics like tree building, supportvector machines, link analysis, genetic algorithms, market-basket analysis, and neuralnetwork analysis.

1.4.1 Data Mining and Machine learning

Machine learning is an important branch of AI and is concerned with finding relationsand regularities present in the data. Rosenblatt introduced first machine learning modelcalled perceptron. Around that time decision trees also have been proposed.

Machine learning is the automation of a learning process. This broad field includes notonly learning from examples, but also involve other areas like reinforcement learning, learningwith teacher, etc. Machine learning algorithm takes the data set and its accompanyinginformation as input and returns a concept. The concept represents the output of learning,which the generalization of the data is. This generalization can also be used to test newcases.

NOTES

7



What does learning mean? Learning like adaptation occurs as the result of interactionof the program with its environment. It can be compared with the interaction between theteacher and the student. This is the domain of Induction learning.

Learning takes place in two stages. During the first stage, the teacher communicatesthe information to the student that the student is supposed to master. The student receivesthe information, understands it. During this stage the teacher has no knowledge of whetherthe information is grasped by the student. This leads to the second stage of learning. Theteacher then asks the student a set of questions to find out how much information has beengrasped by the student. Based on these questions, the student is tested and the teacherinforms the student his assessment. This kind of learning is typically called supervisedlearning.

This process identifies classes to which the instance belongs. This is called classificationlearning. This model learns from examples where a teacher helps the system construct amodel by defining classes and supplying examples of each class. The models learn the dataand generate concepts. The concepts are used to predict the class of previously unseenobjects. This is similar to discriminate analysis in statistics.

The second kind of learning is by self-instruction. Self-Instruction is the commonestkind of learning process. Here the student comes into contact with the environment. Studentinteracts with the environment through punishment/reward mechanism enabling him to learncorrectly. This process of undergoing self-instruction is based on the concept of learningfrom mistakes. More mistakes the student makes and learns from it, better his learningwould be. This type of leaning is called unsupervised learning. Here the program is suppliedwith objects but no classes are defined. The algorithm itself observes the examples andrecognizes patterns by itself based on the principles of grouping. . This is similar to clusteranalysis as in statistics. The quality of the model produced by inductive learning methods issuch that the model could be used to predict the outcome of unseen test cases.

By 1990’s there has been considerable overlap among Machine learning and statisticscommunity. The term KDD was coined to describe the whole process of extractinginformation from data. Data mining is considered to be just one component of KDD,which mainly concerns about learning algorithms.

What is the difference between Machine learning and KDD?

Some subtle differences exist between KDD and Machine learning.

They are

1. KDD is concerned with finding understandable knowledge in a database whileMachine learning is concerned with improving performance of an agent.

2. KDD is concerned with very large, real-world databases, while Machine learningtypically involves smaller data sets.

DMC 1628

NOTES


Machine learning is a broader field compared to KDD. Learning from examples isjust a methodology along with other methods like reinforcement learning, learning withteacher, etc.

1.4.2 Data Mining and Database Technology

Data mining derives its strength from the development of databases technology. Typicallybusiness organizations maintain many databases. But primarily they use two types ofdatabases

1. Operational database2. Strategic database

Operational database is used to conduct the day-to-day operations of the organization.This includes information about customers, transactions, products etc. But strategic databasesare databases that are used for taking decisions. Larger organizations have many operationaldatabases.

For decision making one may have to combine different operational databases.Frequently organizations implement data warehouse technology for this strategic decision-making. Data warehouse provides a centralized database that provides data for data miningin a suitable form. A data warehouse is not a requirement for data mining but by using thedata warehouse solves many problems associated with the data like integrity and errors.Data mining extract the data from the data warehouse into database or data mart.

Operational databases are hypothetical databases. A typical customer database is aoperational database. The well-defined and repetitive queries like “What is the address ofMr. X”, what is the city Ms.Y lives? are the common questions that are answered withtraditional operational database. Operational database has to support large number ofsuch queries and updates on the contents of data. This type of database usage is calledOnline Transaction Processing (OLTP).

But the decision support requires different types of queries. Aggregation is one suchquery. These queries are like “Find the sales of this product in Tamil nadu region year wiseand make a prediction for the next quarter” are the domains of data mining. This kinds ofquery is called Online Analytical Processing (OLAP). This query can be used to summarizethe data of the database and can be made to provide data in a suitable form to data miningto make a prediction. Also OLAP gives graphical presentation of the empirical relationshipbetween the variables in the form of a multidimensional data cube to facilitate the decisionmaking.

OLAP and OLTP pose different kinds of requirement on the database administrators.OLTP requires the data to be update. The queries may operate on the data and if necessarymakes modifications in the database. But OLAP is a deductive process and as an important

NOTES

9



tool of business intelligence OLAP doesn’t modify the data but instead tries to answer thequestion like why certain things are true.

OLAP requires the user to formulate the hypothesis as a query. The query is thenvalidated against the data for confirmation. In other words OLAP analyst generates aseries of hypothesis and validates using data warehouse data.

For a small dataset of fewer variables, OLAP produces results quickly. But for dataset involving hundreds and thousands of variables, the problem of speed and complexity iscrucial. Hence OLAP queries are complex and time consuming. The requirements of OLAPand OLTP are so different that the organizations some time maintain two different sets ofapplications.

OLTP and OLAP can complement each other. Before a query is generated, often theanalyst needs to explore the financial implications of the discovery of patterns that may beuseful for the organizations. However OLAP is not a substitute for data mining. But theycan complement each other. OLAP can help the user in the initial stages of the KDDprocess. This helps data mining to focus on the most import data. Similarly the final datamining results can be easily represented by a OLAP hypercube.

Hence the sequence would be query -> data retrieval -> OLAP -> Data mining.Query is of the lowest information process capacity and data mining is of the highestinformation processing capacity. Hence there should be a tradeoff between informationcapacity and the ease of implementation. The business organizations should know exactlywhat their requirements are.

1.4.3 Data Mining and Statistics

Statistics is a branch of mathematics that has a solid theoretical foundation regardingstatistical learning. But it requires knowledge of the statistical procedures and the guidanceof a good statistician. Data mining however allows the expert’s knowledge of the data andthe advanced analysis techniques by employing the computer systems that replaceseffectively the guidance of statistician.

Statistical methods are developed in relation to the data being analyzed. Also statisticalmethods are coherent and rigorous. It has strong theoretical foundations. Many statisticiansconsider data mining lacks solid theoretical model and data mining has too many competingmodels. The contention is that there is always a model to fit the given data.

Secondly it is often believed that great amount of data often leads into some non-existent relations. Often these issues are referred derogatively as “Data Fishing”, “Datadredging” or “Data Snooping”. While it is worth considering all these issues, modern datamining algorithms pay great attention to generalization of results.

DMC 1628

NOTES


Other aspects that separate the data mining is that statistics generally concerns primarilyabout primary data that is the data store in primary memory like RAM. But data miningrequires huge pile of data. Hence it focuses more on the secondary storage data. Thereforewhile statistics content with experimental data, data mining requires more observed data.

But data mining is not a substitute for statistics. Statistics still have a role to play ininterpreting the results of the data mining algorithms. Hence statistics and data mining willnot replace each other but complement each other in data analysis.

1.4.4 Data Mining and Mathematical Programming

With the advent of Support Vector Machines (SVM), the relation between data miningand mathematical programming is well established. It also provides additional insight thatmajority of the problems of data mining can be formulated using mathematical programmingproblems for which more efficient solutions can be obtained.

1.5 DATA AND ITS STORAGE STRUCTURES

Some of the application domains where huge data storage used are listed below.

Digital library: This contains the text data as well as document images. Image archives: many archives contain larger image databases along with the numeric

and text data. Heath care: This sector uses extensive databases like patient databases, Health

insurance data, doctors’ information, and Bioinformatics information. Scientific domain: The domain has huge collections of experimental data like genomic

data. Biological data. WWW: Huge amount of data is distributed in the Internet. These data are

heterogeneous in nature.

Data mining algorithms can handle different media or data. However the algorithmsmay be different when the algorithms handle different types of media or data. This task is agreat challenge to the researchers.

What are the types of data?

Data are of different types. But generally the data that is encountered in data miningprocess can be of three types. They are

1. Record data2. Graph data3. Ordered data.

1. Record Data

Dataset is a collection of measurements taken from a process. We have a collectionof objects in a dataset. Each object has a set of measurements. The measurements can be

NOTES

11



arranged in the form of a matrix. Row represents an object and can be called as entities,cases, or records. The columns of the dataset are called attributes, features or fields. Thetable is filled with observed data. Also it is better to note the general jargons that areassociated with the dataset. Label is the term that is used to describe the individualobservations. All the input variables whose values are used to make a prediction arecalled descriptors or predictor variables. The variable that is predicted is called responsevariable.

Take for example, a dataset of medical records (Refer Table 1.2)

Table 1.2 Same Medical Data set.

Here each record is about a patient. Each column is an attribute of the patient record.The record may have data in different forms like text, numeric or even image. Themeasurement scale of the data is generally categorized as nominal, ordinal, interval andratio. Nominal is a measurement scale of a variable, which can assume limited number ofdifferent values. For example patient location is (Chennai, Not Chennai). But it difficult tofind larger of these two as it is meaningless. Ordinal is a scale where variable can have anordered data. But the difference will not show the magnitude of the actual difference. Forexample the degree of disease may be high, medium and low.

Definitely as a degree, high is greater than medium and medium is higher than low. Butthe difference will not reveal anything about the magnitude. A variable that have valuewhere the interval has meaning is called Interval scale. But the interval scale does notinclude zero. Finally ratio is a scale whose differences include natural zero, as ratios ofvalues are possible.

Special types of record types are

1. Transaction record2. Data matrix3. Sparse data matrix.

Transaction record (Refer Table 1.3) is a record that records the transactions onregular basis. For example a general store can have a record where the store can recordthe items that are purchased by the customers.

DMC 1628

NOTES


Table 1.3 Sample Transaction Data set.

The characteristics of this data are that the record may have more zeroes. In otherwords the data is of binary in nature and can be marked by more zeroes. Some times theinterest may be in analyzing the non-zero components only.

Data Matrix

Data matrix is a variation of the record type because it consists of numeric attributes.The standard matrix operations can be applied on these data. The data is thought of aspoints or vectors in the multidimensional space where every attribute is a dimensiondescribing the object.

Sparse data matrix

This is also a special type where only non-zero values are important.

2. Graph data

Graph data involves the relationships among objects. For example a web page canrefer to another web page. This can be modeled as a graph. The modes are the web pagesand the hyper link is an edge that connects the nodes.

A special type of graph data is that the node itself is another graph.

3. Ordered data

Ordered data objects involve attributes that have relationships that involve order intime or space.

The examples of ordered data are

1. Sequential data are temporal data whose attributes are associated with time. Forexample the customer purchasing patters during festival time is a sequential data.

2. Sequence data are similar to sequential data but does not have time stamps. Thisdata involve the sequence of words or letters. For example, DNA data is a sequenceof four characters – A T G C

3. Time series data is a special type of sequence data where the data is a series ofmeasurements over time.

NOTES

13



4. Spatial data

Spatial data has attributes such as positions or areas. For example maps are spatialdata where the points are related by location.

Data Storage

Once the dataset is assembled, it has to be stored in a structure that is suitable fordata mining. The goal of data storage management is to make data available to supportprocessing and application for data mining algorithm. There are different approaches toorganize, manage data in storage files and systems from flat file to data warehouses. Eachone has its own unique way of capturing and storing the data.

Some are listed below

1. Flat files2. Databases3. Data warehouses4. Special databases like Object-relational databases, object oriented databases,

transactional databases, advanced databases like spatial databases, multimediadatabases, time-series databases and textual databases.

5. Unstructured and semi structured World Wide Web.

Let us review some of these data stores briefly so the input for the data mining can beconditioned.

1. Flat Files

Flat files are the simplest and most commonly available data source. It is also thecheapest way of organizing the data. These flat files are the files where data is stored inplain ASCII or EBCDIC format.

Simple programs like notepad or spreadsheet can be used to create a flat file datasource for data mining at the lowest level. The flat files may be simple data files or files inthe binary format. But the data-mining algorithm should be aware of the structure of the flatfile for processing.

The flat file is called flat because same fields are located in the same offset with eachrecord. Minor changes of data in flat files affect the results of the data mining algorithms.Hence flat file is suitable only for storing small data set and not desirable if the datasetbecomes larger.

2. Database System

A Database system normally consists of database files and Database managementsystem (DBMS). The Database files contain original data and metadata. DBMS aims to

DMC 1628

NOTES


manage data and improve operator performance by including various tools like databaseadministrator, query processing and transaction manager.

A relational database consists of sets of tables. The tables have rows and columns.The columns represent the attributes and rows represent the tuples. The tuple correspondsto either an object or relationship between objects.

The user can access and manipulate the data in the database using SQL. Data miningalgorthms can use the SQL for taking advantage of the structure that is present in the databut data mining algorithms go beyond SQL by performing tasks such as classification,prediction and deviation analysis.

3. Transactional Databases

A transactional database is a collection of transactional records. Each record is atransaction. A transaction may have a time stamp, identifier and a set of items, which mayhave links to other tables. Normally transactional databases are created for performingassociational analysis that indicates the correlation among the items.

4. Data Warehouses

A data warehouse is decision support database. It provides historical data for datamining algorithm.

Data mining is a subject-oriented centralized data source. It is constructed byintegrating multiple heterogeneous data sources like flat files, databases in a standard format.It provides subject oriented, integrated, time variant and non-volatile collection of data fordecision support systems and data mining.

By subject oriented we mean that the data is stored around subjects of the organizationlike customer, sales etc. By integrated, we mean that the data is from a multiple datasources. The data is cleaned, preprocessed and a reliable data for mining. By non-volatile,we mean that the data provided by the data warehouse is historical in nature and normallyno updates, modifications and deletions are carried out. Thus the data warehouse is like acontainer that has all the necessary data that is necessary for carrying out the businessintelligence operations.

There are two ways in which the data warehouse can be created. First as a centralizedarchive that collects all the information from all the data sources. Second approach is byintegrating different data marts to create a local data warehouse. In this approach, initiallythe data marts are not connected. But slowly the integration is carried out so that a centralizeddata source can be created. Thus a data warehouse has centralized data, a metadatastructure and a series of specific data marts that are accessible by the users.

NOTES

15



Modern data warehouses for web are called Data Webhouses. With the advent ofweb, data warehouse becomes web data warehouse or simply data webhouse. The weboffers most heterogeneous and dynamic repository, which includes different kinds of datalike raw data, image, video that can be accessed by browser. The web data has threecomponents – The content of the web, which includes the web page, web structure, whichincludes the relationship between the documents, and the web usage information. The datawarehouse has integrated browser to access the data. These data can be put intoconventional sources of data for data mining algorithms. Speed is very crucial for thesuccess of Webhouses.

Another variation of data warehouse is Data marts. A data mart is a thematic databasewhich is a database that is completely oriented towards managing customers. Many datamarts can be created for managing customers with a specific goal like marketing. Thesedata marts then slowly interconnected so as to create a data warehouse.

OLAP is used to produce fast, interactive answers for user queries for data warehouses.A data cube allows such a multidimensional data to be effectively modeled and viewed inN-dimension. Some of the typical data mining queries are like summarization of data orother tasks like classification, prediction and so forth.

5. Object-oriented databases

This is an extension of relational model by providing facilities for complex objectsusing object orientation. The entity in this model is called objects. For example, a patientrecord in a medical database is a object.

The object includes a set of variables, which are analogous to attributes. The objectcan send or receive messages with other objects for communication and each object has aset of methods. Methods can receive a message and returns a value. Similar objects canbe grouped into a class and all the objects inherit all the variables that belong to the superclass. Data mining should have provisions to handle complex object structures, data types,class hierarchy and object orientations like inheritance, methods and procedures.

6. Temporal Databases

Many data mining applications are content to deal with static databases where thedata has no timing relationships. But significantly for some cases, records associated withthe time stamps greatly enhance the mined knowledge.

Hence temporal databases stores time related data. The attributes may have timestamps of different semantics. Already many databases use fields like “data created”, “datamodified” etc.

DMC 1628

NOTES


Sequential databases stores sequences of ordered events with or without the notionof time

Time-series database stores time related information like log files. This data representthe sequences of data, which represent values or events obtained over a period (Sayhourly, weekly or yearly) or repeated time span. Observation of sales of product continuouslymay yield a time-series data. Data mining also performs trend analysis upon these temporal,sequence and time series database.

7. Spatial Databases and Spatiotemporal Databases

Spatial databases contain spatial information in a raster format or vector format. Rasterformats are either bitmaps or pixel maps. For example images can be stored as a rasterdata. On the other hand, vector format can be used to store maps because maps use basicgeometric primitives like points, lines, polygons and so forth.

Many of the geographic databases are used by data mining algorithms to predictlocation of telephone cables, pipes etc. The applications using data mining for spatial datainclude clustering, classification, association and outlier analysis.

Spatiotemporal databases use spatial data along with time information.

8. Text Database

Text is one of the most commonly used multimedia data type. It is also a naturalchoice of communication among users. The text is present in documents, emails, and Internetchat and also in “gray domains” whose content is available for selected audiences.

Text databases contain word descriptions for objects. These are long sentences orparagraphs. The text data can be unstructured (Like documents in HTML or XML), semistructured (Like emails) and structured (Like library catalogues). Most of the text data is incompressed form also.

Retrieving the text is generally considered to be part of the Information retrieval study.But text mining is proving to be quite popular with the advent of Internet technology withapplications like text clustering, text retrieval and text classifying systems. Also use of textcompression for improving the efficiency of the text mining systems is a great challenge.

9. Multimedia databases

Multimedia databases are specialized databases that contain high dimensionalitymultimedia data like images, video, and audio. The data may be stored either in a compressedformat or in uncompressed format. Multimedia databases are typically very large databasesand data mining of these data is quite useful in applications like content based retrieval,visualization.

NOTES

17



10. Heterogeneous and Legacy databases

Many business organizations are truly multinational organizations spanning acrossmany continents. Their database structure is interconnection of many heterogeneousdatabases because each database can have different types of objects and different typesof message formats to do effective processing. This task is bit complicated because of theunderlying diverse semantics.

Many enterprises have existing infrastructure where diverse data storage forms likeflat files, relational databases, and other forms of data storage forms connected by networks.This sort of systems is called legacy systems.

11. Data Stream

Data stream are dynamic data, which flow in and out of the observing environment.The typical characteristics of data stream are huge volume of data, dynamic, fixed ordermovement and its real time constraints. Also capturing and storing these dynamic data is agreat challenge.

12. World Wide Web

WWW provide a diverse, world wide, online information source. The objective ofmining these data is to mine interesting patterns of information.

WWW is a huge collection of documents that comprises information of semi structuredinformation (HTM, XML), hyper link information, access and usage information anddynamically changing contents of the web pages. Web mining refers to the process of webfor knowledge extraction.

1.6 COMPONENTS OF A DATA MINING ALGORITHM

Data mining algorithm consists of the following components

1. Model or Pattern2. Preference or scoring function3. Search algorithm4. Data strategy

All data mining models either produce a model or pattern. It is better to have a clearidea of the difference between a model and a pattern.

1. A model is a high level, global description of data. Data mining algorithm fitsmodels for given data. Choosing a suitable model is an important step for datamining process. The model may be descriptive (it can summarize the data in acomprehensive manner) or predictive (it can infer new information). This allows

DMC 1628

NOTES


the user to get a sense of the data and enables the user to make some statementsabout the nature of data or predict some future data/information.

For example, in a regression analysis, a basic predictive model can be constructed asa function of some form like

Y = a X + b

Here X is a predictor variable, Y is the response variable and the variables ‘a’ and ‘c’are parameters of the variable. This equation may be used to predict the expenditure of aperson given his annual income. This model may be perfect because for a person spendingmay increase with the income. Hence the response variable is linearly related to the predictorvalue.

In contrast to the models, pattern structures make statements about restricted regionsof the space spanned by the variables range. In other words, a pattern describes a smallpart of the data.

For example, If Y > y, then Probability (X > x1) = p1.

These variables are arbitrary. This just introduces the constraints on the variablesconnected by a probabilistic rule. Certainly not all the records obey this rule. Only therecords that feature these variables are affected by this rule. These sort of restrictedstatements are characteristics of patterns.

Hence the model or pattern structures are associated with parameters, and the datamust be analyzed to estimate its values. This is the job of the data-mining algorithm. A goodmodel or pattern has specific, optimized value for its parameters.

Therefore, the distinctions between models and patterns are very important. Sometimes it is often difficult to recognize a model or pattern as the boundary separating them isso thin. But a careful analysis can help to distinguish a model from a pattern.

IBM has identified two types of model or modes of operation, which may be used tounearth information of interest to the user.

Verification Model

The verification model takes a hypothesis from the user in the form of a query. Then ittests the validity of the hypothesis against the data. The responsibility of formulating thehypothesis and query belongs to the user. Here no new information is created. But only theoutputs of the queries are analyzed to verify or negate the hypothesis. The search is refinedtill the user detects the hidden information.

NOTES

19



Discovery Model

The discovery model discovers important information hidden in the data. The modelsifts the data for frequently occurring patterns, trends and generalizations. Often the userrole is very limited here. The aim of this model is to reveal the larger amount of hiddeninformation that is present in the dataset.

Some authors classify the models as

1. Descriptive models

The aim of these models is to describe the dataset in terms of groups. These arecalled symmetrical, unsupervised or indirect methods.

2. Predictive models

The aim of these methods is to describe one or more variables in relation to all theother attributes. These are called supervised, asymmetrical or direct methods. The methodssift the data for hidden rules that can be used for classification or clustering.

3. Logical models

These models identify the particular characteristics related to the subset interests ofthe database. The factor that distinguishes this method from others is that these methodsare local in nature. Association is an example of logical model.

4. Preference function (Score function)

The choice of selecting the models depends on the given data. Normally some formof goodness-of-fit function is used to decide a good model. Normally preference or scorefunctions are used to specify the preference criterion. Ideally a good score function shouldindicate the true expected benefits of the models. But in practice such a score function isdifficult to find. However, without the score functions, it is difficult to predict the quality ofthe model.

Some of the score functions that are used commonly are likelihood, sum of squarederrors, misclassification rates.

For example, for the above model,

(y(i) – y(i))2 can be used as the score function.

Here y(i) is the target or expected prediction and y(i) is the actual prediction for 1<=I <= n. This scoring function estimates the error based on the difference between theactual and expected values.

DMC 1628

NOTES


Sometimes, it may be difficult to apply a scoring function for a very large set, but thescoring functions can be moderated by the practicality of applying them. Also the scoringfunctions must be robust. In other words if the scoring functions are very susceptible to thechanges in the dataset, small change in the dataset can dramatically change the estimate ofthe model. Hence a robust scoring function is a necessity of a good data-mining algorithm.

5. Search Algorithm

The goal of the search algorithm is to determine the structure and parameter values toachieve maximum or minimum (depending on the context) value of the score function.Normally these algorithms are presented as an optimization problem or estimation problem.The search algorithm finds specification of an algorithm for finding particular models orpatterns and parameters of the model for a given data set and score function.

Normally the problems of finding interesting pattern are often posed as a combinatorialproblem. Often heuristic search algorithms are utilized to find interesting patterns. For examplefor the above linear regression problem, minimizing the least squares function must besolved by the search problem. A good data-mining algorithm should have a good searchalgorithm to achieve its primary objectives.

6. Data Management Strategy

Data management strategy deals with indexing and accessing of data. Many algorithmsassume that data is available in the primary memory. But for data mining it is inevitable thatmassive datasets still reside in the secondary memory.

Many data mining algorithms scale very poorly when applied to larger dataset thatreside in the secondary store. Hence it is necessary to develop data mining algorithm withan explicit specification of data management strategy.

1.7 STEPS IN DATA MINING PROCESS

In can be remembered that KDD is a complete process and Data mining is analgorithmic component of the KDD process. The KDD process is outlined as below

1. Understanding the application domain

Defining the objectives for data mining is the preliminary step in the KDD process.The business organization should have a clear objective for implementing data miningprocess. These objectives should be translated into a set of clear tasks that can beimplemented. A clear-cut problem statement is required for implementing data miningprocess. A good problem statement should have no room for doubts or uncertainties.

NOTES

21



2. Extracting the target dataset

First it is necessary to identify the data sources. The ideal source of the data miningprocess is a data warehouse. A sort of exploratory analysis can be carried out to identifythe potential data mining tasks that can be performed.

3. Data Preprocessing

This is essential for improving the quality of the actual data. This step is required forgetting high quality information. This step requires data cleaning, data transformation, dataintegration, data reduction or data compression.

a. Data cleaning: This process involves basic operations such as normalization, noiseremoval, handling missing data, and reduction of redundancy. The real world datais often erroneous; hence this step is vital for data mining.

b. Data integration: This operation involves integrating multiple, heterogeneous datasetsgenerated form different sources.

c. Data reduction and Projection: This operation includes identification of necessaryattributes for data mining, reducing the number of attributes using dimensionreduction, feature extraction and discretization methods.

4. Application of data mining Algorithm

This includes the data mining algorithms for the data. This algorithm is related toperforming tasks like classification, regression, clustering, summarization, and association.Basically data-mining operations infer a model from the data. Basically the models can beclassified according to the aim of the analysis.

5. Interpretation of results

The objective is to extract the hidden, meaningful patterns that can be of interest andhence useful for the organization. The usefulness is always determined by the metrics.There are two types of metrics that can be used – Objective and subjective metrics.

Objective metrics use the structure of the pattern and is quantitative in nature.Subjective metrics takes into account the user rating of the knowledge obtained. Thisincludes

1. Unexpectedness: This is potentially new information that is previously unknownto the user. This amounts to some sort of discovery of the fact.

2. Actionability: This is a factor the user can take advantage to fulfill the goal of theorganization.

DMC 1628

NOTES


6. Using the results.

The business knowledge that is extracted should be integrated into the intelligentsystems that are employed by the organization to exploit its full potential. This processshould be done gradually and the process is continued till the intelligent system is achievingits perfection.

The process of integration involves four phases

1. Strategic phase: This phase involves steps like 1. Identification of areas wheredata mining could give benefits 2. Definition of objectives for a pilot data miningproject and 3. Evaluation of the pilot project using suitable criteria.

2. Training Phase: This phase is used to evaluate the data mining activity carefully. Ifthe pilot project is positive, then the evaluation and formulation of the prototypedata mining and data mining technique is right.

3. Creation Phase: If the results of the pilot project are satisfactory then a plan can beformulated so that the business procedure can be reorganized to include datamining activity. Then the project can be initiated to carry out such a task by allocatingadditional personnel and time.

4. Migration phase: This phase includes training the personnel so that the organizationcan be prepared for data mining integration. These steps are repetitive in natureand are continued till the objectives are met.

1.8 DATA MINING FUNCTIONALITIES

Data mining software analyzes relationships and patterns in stored data. Generally itidentifies four kinds of relationships that are present in the data. They are as follows

Classes: Stored data is used to locate data in predetermined groups. This iscalled categorization. For example the Sex is a category under which the studentsof the class can be categorized.

Clusters: Data items are grouped according to logical relationships or similaritiesthat are present in the data.

Associations: Data can be mined to identify associations that exist between thedata. For example the person who buys item X also buy item Y often

Sequential patterns: Patterns that is used to anticipate behavior or trends; thisinvolves prediction of consumer purchase patterns.

Some of the model functions are listed below

1. Exploratory Data analysis2. Classification3. Regression4. Clustering5. Rule generation6. Sequence analysis

NOTES

23



1.8.1 Exploratory Data Analysis

The main idea is to explore the given data without any clear ideas of what we arelooking for. Mostly these techniques are interactive in nature and produce output in visualform. Before we do analysis, it is necessary to understand the data. Data are of severaltypes. They are numerical, Categorical (Tall/Short). Categorical data may further be dividedinto ordinal (Data have some order – high/middle/low) or nominal, which are unordereddata. Hence producing the summary reports that include the averages, standard deviations,distributions of data make the data understandable for the user.

Graphing and visualizing tools are vital for EDA. Ultimately all the analyses areperformed for the users. Hence producing the output that can be easily understood by thehumans in the graphical form like charts, plots are a necessary requirement of EDA. Usuallyfor dataset of small dimensions, graphical outputs are easy but for multiple dimensionsvisualization becomes a difficult task. Hence the higher dimensionality data are displayedat lower resolution while attempting to retain as much information as possible.

The kinds of exploratory analysis that are useful to the data mining are

1. Data characterization involves the summarization of data. For example an attributecan be summarized. OLAP is a good example of data characterization. The outputof the data characterization is generally in the form of a graph like pie chart, barchart, curves or OLAP cubes.

2. Data discrimination involves the comparison of target variables with other attributesof the object.

Both data characterization and discrimination can be combined to get a better viewof the data before data mining process.

1.8.2 Classification

Classification model is used to classify the test instances. The input for the classificationmodel is typically a dataset of training records with different attributes. The attributes x1,x2, … , xn are the input attributes. These are called predictor or independent variables.What are predicting – the class labels are called response, dependent or target variables.The classification model predicts the class to which these input records are categorized.

The resulting model is then used to classify the new instances or test instances.

Some of the examples of the classification tasks can be given as

1. Classification of dataset of disease symptoms for disease2. Classification of customer signatures as valid or Invalid categories3. Classification of customer financial and general background for loan approval.

The classification models can be categorized based on the implementation technology.

DMC 1628

NOTES


1. Decision Trees: The classification model can generate decision trees as output.The decision trees can be interpreted to classification rules, which can beincorporated into an intelligent system or can be used to augment expert systems.

2. Probabilistic methods: These include models, which use statistical properties likeBayes Theorem.

3. Nearest neighbor classifiers which use distance measures.4. Regression methods, which can be linear or polynomial.5. Soft computing approaches: These soft computing approaches include neural

networks, genetic algorithms and rough set theory.

Regression is a special type of classification that uses existing values to forecast newvalues. Most of the problems are not linear projections of the previous values. For exampleshare market data will have fluctuations and most of the values that need to be predictedare numerical. Hence more complicated techniques are required to forecast future values.

1.8.3 Association Analysis

A good example of association function is Market Basket Analysis. All transactionsare taken as input and the output generates shows a sort of analysis that shows the associationbetween items of the input data.

Association data mining algorithms produce the results of the form X->Y where X iscalled antecedent and Y the subsequent.

Association rule mining algorithms calculates the frequency with which a particularassociation appears in the database is called support or prevalence.

The relative frequency of occurrence of items and their combinations – That is giventhe occurrence of items X, how often the consequent is B is referred as confidence.Confidence is a measure which is used by the association analysis is the ratio betweenfrequency of X and Y together divided by the frequency of A.

Lift is also a measure of the power of association. It is calculated as the ratio ofConfidence of X -> Y/ Frequency of Y. Higher is the lift , then greater is the influence thatthe occurrence of X has on the likelihood that B will occur.

Graph tools are used to visualize the structure of links. Thicker lines can be used toindicate the stronger associations and these linkage diagrams show analysis that is termedas link analysis.

The Sequential/temporal pattern functions analyze a collection of records over a periodof time to identify the trends. These models take into account the distinct properties of timeand periods (Like Week, calendar year). These business transactions are analyzed frequentlyfor a collection of related records of the same structure.

NOTES

25



The records are related by the identity of the customer who did the repeated purchases.Such analyze can assist the business organizations to target a specific group often.

1.8.4 Clustering/Segmentation

Clustering and segmentation are the processes of creating partitions. All the dataobjects of the partitions are similar in some aspect and vary from the data objects in theother partitions significantly. This is also called the unsupervised classification where thereare no predefined classes.

Some of the examples of clustering process are

1. Segmentation of a region of Interest in an image2. Detection of abnormal growth in a medical image3. Determining clusters of signatures in a gene database.

The quality of the clustering algorithm depends of the similarity measures it uses andits implementation. A good segmentation algorithm should generate a cluster that has highintra-class similarity and low interclass similarity. Also the clustering process should detectthe hidden pattern that is present in the dataset.

Clustering algorithms can be categorized as

1. Partitional: These algorithms create a initial cluster and then objective is optimizedusing an iterative control strategy.

2. Hierarchical: These algorithms produce hierarchical relationships that exist in thedataset.

3. Density based: The algorithms use connectivity and density functions to produceclusters.

4. Grid-based: The algorithms produce a multiple level granular structure by quantizingthe feature space in terms of finite cells.

1.8.5 Outlier Analysis

Outliers express deviation from the normal behavior or expectations. These deviationsmay be due to noise also. Hence data mining algorithms should be very careful whether thedeviations are real or is it due to noise. Detection of truly unusual behavior in a givenapplication context is called outlier analysis.

The applications where outliers prove to be very useful are forecasting, fraud detectionand customer abnormal behavior detection.

1.8.6 Evolution Analysis

Evolution analysis models and describes the trends of an object over a period oftime. The distinct features of the analysis include time-series data analysis. Sequence orperiodicity pattern matching and similarity based analysis.

DMC 1628

NOTES


1.9 DATA MINING PROCESS

The data mining process involves four types

The steps are

1. Assembling the data from various sources2. Present the data to data mining algorithm3. Analyze the output and interpret the results4. Apply the results to new situations.

The emerging process model for the data mining solutions for business organizationsis CRISP-DM. This model stands for Cross Industry Standard Process – Data Mining.This process involves six steps

The steps are listed below

1. Understanding the business

This step involves the understanding the objectives and requirements of the businessorganization. Generally a single data mining algorithm is enough for the giving the solutions.This step also involves the formulation of the problem statement for the data mining process.

2. Understanding the data

This step involves the steps like

Data collectionStudy of the characteristics of the dataFormulation of hypothesisMatching of patterns to the selected hypothesis

3. Preparation of data

This step involves in producing the final data set by cleaning the raw data and thepreparation of data for the data mining process.

4. Modeling

This step involves in the application of data mining algorithm for the data to obtainmodel or pattern.

5. Evaluate

This step involves the evaluation of the data mining results using statistical analysis andvisualization methods.

NOTES

27



6. Deployment

This step involves the deployment of the results of the data mining algorithm to improvethe existing process or for a new situation.

1.10 ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM

The architecture of a typical data mining system implements the data mining process.It has necessary components to implement the data mining process. The typical componentsof a data mining system are listed below

1. Database: A typical system may have one or more databases. The data present inthe database is cleaned and integrated.

2. Database server: The database server is responsible for fetching the data as perthe use query.

3. Knowledgebase: This is the domain knowledge that is used to guide the search orevaluate the outcomes of the query based on the criterion of interestingness.

4. Data mining engine: This represents the heart of the system. This contains thenecessary functional modules for classification, prediction, association or deviationalanalysis.

5. Pattern evaluation module: This module uses the necessary metrics to indicate thequality of the results. The system may use measures like Interestingness to indicatethe quality of the results.

6. GUI: This provides the necessary interface between the user and the data miningsystem. This allows the user to interact with the system, to provide the userquery and to perform the necessary mining tasks. The interface also converts theresults using the visualization methods for the user for interpretation.

One of the major decisions that a organization needs to take is to decide the kind oninteraction the data mining system has with the existing system. If there is no interconnectionexists between the data mining system and database or data warehouse, it is called nocoupling or it can be said based on the degree of connection as loose coupling, Semitightcoupling and tight coupling. Let us review some of the systems briefly

No coupling

No coupling means there is no interconnection between the data mining system andthe existing systems. The data mining system can fetch data from any data source for datamining tasks and produce the results in the same data storage system. No coupling is apoor design strategy. Some of the major disadvantages of this kind of arrangement are

The data mining algorithms needs to spend a lot of time in extracting the data frommultiple source, to clean it, transform and to integrate it so that results are reliable. This isa big responsibility on the data mining algorithms.

DMC 1628

NOTES


Scalable and flexible algorithms and structures of the database and data warehouseare not utilized. This leads to the degradation of the performance of the algorithms.

Loose coupling

This design facilitates the data mining system to use some facilities of the existingdatabase or data warehouse. Most of the responsibility of supplying good quality data isundertaken by the existing database or data warehouse thus making data mining algorithmsconcentrate on other aspects like scalability, flexibility and efficiency. One of the majorproblems associated with the loose coupling system is that these loose coupling systemsare memory based. So handling larger set makes it difficult for these systems to achievehigh scalability and good performance.

Semitight coupling

For these systems, there is a interconnection between the data mining system and theexisting database or data warehouse. Also few data mining primitives are provided by thedatabase or data warehouse. These primitives provide some basic functionality like sorting,aggregation etc. Moreover some of the frequently used intermediate results of the datamining system are stored in database or data warehouse for reuse. This strategy leadsenhance the performance of the system.

Tight coupling

Here the data mining system and the existing database or data warehouse is completelyintegrated. Database and data warehouse optimizes the data mining queries using existingtechnology. This approach is highly desirable as integrated environment provides a superiorperformance. But not all the organizations possess a completely integrated environment. Alot of effort is required to design this environment and the data-mining primitives.

Task primitives query

A data-mining task is some form of data analysis. For example classification is a data-mining task. A data-mining query represents the data-mining primitive. The usercommunicates with data mining system with a query. The data mining system executes thequery and produces the result in a user understandable form generally in the form of graphics.

The data mining task primitive is in the form

1. Set of task relevant data

This specifies the portions of database or the data in which the user is interested.

2. Kinds of knowledge

This specifies the kind of data mining functions

NOTES

29



3. Background knowledge

This is knowledge about the domain. This is used to guide the data mining processeffectively. Concept hierarchies are popular way of representing the background knowledge.The user can also specify the beliefs in the form of constraints and thresholds.

4. Interesting measures and Thresholds

These measures are used to guide and evaluate the results of data mining. For examplethe association results can be evaluated using Interestingness criteria like support andconfidence.

5. Representation

The user can specify the form in which the results of the mining can be represented.These may include rules, tables, charts or any form that user wishes to see mined patterns.

Data mining query can incorporate these primitives. Designing complete data mininglanguage is a quite challenging task as every data mining task has its own requirements.Some efforts are being done in this respect. For example Microsoft OLE DB includesDMX. DMX is a XML styled data mining language and there exists some form datamining languages like PMML (Programming Data Model Markup language) and CRISP-DM (Cross-Industry Standard Process for Data Mining) standards for data mining languagerequirement.

1.11 CLASSIFICATION OF DATA MINING SYSTEM

The data mining systems can be categorized under various criteria

1. Classification according to the kinds of databases use. The databases itself areclassified according to the data models like relational, transactional, object-oriented,or data warehousing system.

2. Classification according to the special kinds of data handled. Based on this thedata mining system can be categorized as spatial, time-series, text or multimediaor Web data mining system. There can be also other types like heterogeneousdata mining systems or legacy data mining systems.

3. Classification according to the kinds of knowledge mined: Based on this the datamining system can be classified based on the functionalities like classification,prediction, clustering, trend, and deviation analysis. A system can provide a singleor more functionality. The knowledge from such systems can also vary based onthe granularity or levels of abstraction like the general knowledge, primitive levelknowledge or knowledge at multiple levels.

4. The systems can be classified based on the level of user interactions such asautonomous systems, interactive exploratory systems or query driven systems.

DMC 1628

NOTES


Summary

Data mining is an automated way of converting the data into knowledge Data is about facts. Information is processed data in the form of patterns,

associations or relationships among data. Knowledge is processed information ashistorical patterns and trends. Intelligence is application of knowledge.

Learning is like adaptation occurs as the results of interaction of the program withits environment

Data storage plays an important role for designing the data mining algorithms. Typicaldata storage systems include flat files, databases, object relational databases,Transactional databases, Data warehouses, Multimedia databases and spatialdatabases.

Typical data mining has components models/patterns, score or preference functionand search mechanism.

Models of two types. They are verification model and discovery model. KDD process includes steps like understanding of domains, Extraction of dataset,

data preprocessing, and application of data mining model, Interpretation andapplication of results.

Data mining tasks include exploratory data analysis, classification, prediction, andassociation and deviation analysis.

Data mining system architecture should include data repository, knowledgebase,data mining engine, pattern evaluation module, and an graphic interface.

Data mining can be classified based on various criteria like – based on databasesused, kinds of data handled, kinds of knowledge mined and levels of userinteractions.

DID YOU KNOW?

1. What is the difference between data and Information?2. What is the difference between knowledge and Intelligence?3. What is the difference between KDD and data mining?4. What is the difference between model and pattern?

Short questions

1. What does data mining mean?2. Distinguish between the terms: Data, Information, Knowledge and

Intelligence.3. What is the relationship between KDD and data mining?4. What is the difference between operational and Strategic database?5. What is the difference between OLAP and OLTP?6. What is the difference between Statistics and Data mining?7. What is the role of data storage systems for data mining?8. What is the specialty of a Data warehouse?9. What are the types of data mining models?

10. What is the difference between a model and pattern?

NOTES

31



11. What are the functionalities of data mining?12. What is the difference between classification and prediction?13. What are the components of a data mining architecture?14. How data mining systems are classified?

Long Questions

1. Explain in detail the KDD process.2. Explain in detail the functionalities of data mining model with examples.3. Explain in detail the components of the data mining architecture.4. Explain the ways in which the data mining system can be integrated in modern

business environment.5. Write short notes on

Text mining Multimedia mining Web mining

DMC 1628

NOTES


NOTES

33



UNIT II

DATA PREPROCESSING ANDASSOCIATION RULES

INTRODUCTION

The raw data are often incomplete, invalid and inaccurate. This kind of error data isalso referred as dirty data. It causes problems for many organizations because the informationor knowledge obtained from the dirty data will be unreliable. Researchers have found thatone-third of the dirty data causes delay or even scrapping of the existing projects incurringthe huge loss to the organization. There is a proverb – GIGO. If garbage is in then theoutput is also garbage. So the bad data has to be preprocessed so that the quality of theresults can be improved. This chapter presents the concepts related to data processingand the methodologies for performing association analysis.

Learning Objectives

To understand the characteristics of the data To explore the methodologies for performing data cleaning, data integration and

data transformation To provide an overview of exploratory data analysis. To explore associative rule mining for performing market basket analysis.

2.1 UNDER STANDING THE DATASET

Exploratory data Analysis (EDA) usually does not have any priori notions of theexpected relationships among the variables. This is true especially because data miningalgorithms involve large data sets. Hence the main goal of EDA is to

Understand the data set Examine the relationships that exists among attributes Identify the target set Have some initial ideas of data set.

DMC 1628

NOTES


The initial requirement of EDA is data understanding and data preparation.

Data Collection

data set can be assumed to be a collection of data objects. The data objects may berecords, points, vectors, patterns, events, cases, samples or observations. These recordscontain many attributes. Attribute can be defined as the property or characteristics of anobject.

For example, consider the following database shown in sample Table 2.1.

Table 2.1: Sample Table

very attribute should be associated with a value. This process is called measurementprocess. Measurement process associates every attribute with a value. The type of theattributes determines the kinds of values the attributes can take. This is often referred asmeasurement scale types.

Attribute Types

The attributes can be classified into two types

Categorical or qualitative data Numerical or quantitative data

The categorical data can be divided into two types. They are nominal type and ordinaltype.

In the above table, patient ID is a categorical data. Categorical data are symbols andcannot be processed just like a number. For example, the average of an patient ID doesnot make any statistical sense.

Nominal data type provides only information but contains no order. Only operationslike (Equal to and not Equal to) are meaningful for these data. For example, the patientdata can be checked for equality and nothing else.

NOTES

35



Ordinal data provides enough information about order. For example, Fever = {Low, Medium, High} is a ordinal data. Certainly Low is less than medium and medium isless than high irrespective of the value. Any transformation can be applied to these data toget a new value.

Numeric or qualitative data can be divided into two categories. They are Intervaltype and ratio type.

Interval data is a numeric data for which the differences between values are meaningful.For example, there is a difference between 30 degree and 40 degree. Only the permissibleoperations are + and -.

For ratio attributes, both differences and ratio are meaningful. The difference betweenthe ratio and Interval data is the position of zero in the scale. For example take the centigrade-Fahrenheit conversion. Zero centigrade is not equal to zero Fahrenheit. The zeroes of bothscales do not match. Hence these are Interval data.

The data types can also be classified as discrete and continuous values. Discretevariables have a finite set of values. The values can be both categorical or numbers. Binaryattributes are special attributes that have only two values – true or false. Binary attributeswhere only non-zero elements play an important role is called asymmetric binary attributes.Continuous attribute is one whose values are real numbers.

The characteristics of the larger datasets

Data mining involve very large datasets. The general characteristics of the largerdatasets are listed below.

1. Dimensionality

The number of attributes that the data objects possess is called data dimension. Dataobjects with less number of dimensions may not cause many problems. But the higherdimensionality data poses much problems.

2. Sparseness

Sometimes, only few non-zero elements play an important role. Hence it is sufficientto store only these non-zero elements and to process it in data mining algorithms.

Sometimes in the case of images and video databases, resolution also plays animportant role. Coarse resolution eliminates certain information. On the other hand, if theresolution is too high, the algorithms become more computational intensive and data maybe buried in noise.

DMC 1628

NOTES


2.2 DATA QUALITY

What are the requirements of the good quality data?

While it is understood that good quality data yields good quality results, it is often verydifficult to pinpoint what constitutes the good quality data. Some of the properties that areconsidered desirable are

1. Timeliness

Decay of data should not be there. Say after a period of time, the business organizationdata may become stale and obsolete.

2. Relevancy

The data should be relevant and ready for the data mining algorithm. All the necessaryinformation should be available and there should be no bias in the data.

3. Knowledge about the data

The data should be understandable, Interpretable and should be self sufficient for therequired application as desired by the domain knowledge engineer.

The data quality issues involve both measurement error and data collection problems.The detection and correction of data is called data cleaning. Hence the data must havenecessary data quality.

Measurement errors refer to the problem of

Noise Artifacts Bias Precision / Reliability Accuracy.

Data collection problems involve

Outliers Missing values Inconsistent values Duplicate data

Outliers are data that exhibit the characteristics that are different from other data andwhose values are quite unusual.

It is often desirable to distinguish between noise and outlier data. Outliers may belegitimate data and some times are of interest to the data mining algorithms.

NOTES

37



Measurement Error

Some of the attribute values differ from the true value. The numerical difference betweenthe measured and true value is called error. This error results from the improper measuringprocess. On the other hand, errors can arise from omission, duplication of attributes also.This is known as data collection error.

Errors arise due to various reasons like noise and artifacts.

Noise is a random component and involves the distortion of a value or the introductionof spurious objects. Often the noise is used if the data is spatial or temporal component.Certain deterministic distortions in the form of a streak are known as artifacts.

The data quality of the numeric attributes are determined by the following factors

Precision Bias Accuracy

Precision is defined as the closeness of the repeated measurements. Often standarddeviation is used to measure the precision.

Bias is measured as the difference between the mean of the set of values and the truevalue of the quality.

Accuracy is the degree of measurement errors that refers to to the closeness ofmeasurements to the true value of the quantity. Normally the significant digits used to storeand manipulate indicate the accuracy of the measurement.

Data collection Problems

These problems are introduced at the capturing stage of the data warehouse.Sometimes the data warehouse itself may be bad existing data and the captured maybecome cumulative because of the existing bad data. Also the dirty data can enter at thedata capturing stage through inconsistencies during combining and integrating data fromdifferent streams.

Types of bad data

The bad or dirty data can be of the following types

Incomplete data Inaccurate data

The real world data may contain errors or outliers. Outliers are values that are differentfrom the normal or expected data. These may be fault of the data capturing instrument,errors in transmission, technology problems.

DMC 1628

NOTES


Inconsistent data: These data are due to the problems in conversions, inconsistentformats, difference in units.

Invalid data (Includes illegal naming convention/ Inconsistent field size) Contradictory data Dummy data

Some of these errors are due to human errors like typographical errors or may be dueto measurement process and structural mistakes like improper data formats.

Stages of Data management

It is also often difficult to track down these dirty data, evaluate them and to removethem. Hence a suitable data management policy should be adopted to avoid the datacollection errors.

The stages of the data management consist of the following steps. The steps are asshown below

Evaluation of quality of the data Establish procedures to prevent dirty data Fixing bad quality data at the operational level Training people to manage the data Focusing the critical data Establishing the business rule in the organizations for handling dirty data Developing a standard for meta data

This is an iterative process and this process is carried out on a permanent basis toensure that data is suitable for data mining. Hence data preprocessing routines should beapplied to clean the data so that they will make data mining algorithms to produce correctand reliable results. Data preprocessing improve the quality of the data mining techniques.

2.3 DATA PREPROCESSING

To remove the measurement and data collection errors, the raw data must bepreprocessed to give accurate results. Some of the important techniques of datapreprocessing are listed below.

Data Cleaning

These are routines that remove the noise and solve inconsistency problems

Data Integration

These routines merge data from multiple sources into a single data source.

NOTES

39



Data Transformation

These routines perform operations like normalization to improve the performance ofthe data mining algorithm by improving the accuracy of the results and efficiency of theresults.

Data reduction

These routines reduce the data size by removing the redundant data and features.

Data cleaning routines clean up the data by solving the problems of missing data byfilling it up, removing noise (by smoothening), remove outliers and solve inconsistenciesassociated with the data. This enables data mining to avoid Overfitting of the models.

Data Integration routines integrate data from multiple sources of data. One of themajor problems of data integration is that many data sources may have same data underdifferent headings. So this leads to redundant data. The main goal of data integration is todetect and remove redundancies that arise from integration.

Data transformation routines normalize the attribute values to a range say [-1 - 1.0].This is required by some of the data mining algorithms to produce better results. Neuralnetworks have built-in mechanisms for normalizing the data. For other algorithmsnormalization should be carried out separately.

Data reduction reduces the data size but still produces the same results. There aredifferent ways in which the data reduction can be carried out. Some of them are listedbelow

Data aggregation Attribute feature selection Dimensionality reduction Numerosity reduction Generalization Data discretization

2.4 DATA SUMMARIZATION

Descriptive statistics is a branch of summarizing the dataset. It is used to summarizeand describe data. Any other number we choose to use for data mining algorithms alsomust be summarized and described. Descriptive statistics are just descriptive and do notgo beyond that or in other words descriptive statistics don’t bother too much aboutgeneralizing.

These techniques help to understand the nature of the data which helps us to determinethe kinds of data mining tasks that can be applied to the data. The details that help us tounderstand the data are measures of central tendency like mean, median, mode and mid

DMC 1628

NOTES


range. Data dispersion measures like quartiles, Interquartile range and variances helps tounderstand data better.

Data mining process involves typically larger data set. Hence enormous amount oftime is needed to understand the results. Hence the need of the hour is data mining algorithmsshould be scalable. By scalable, we mean that the data mining algorithm should be able topartition that data perform the operations on the partitions and should have the ability tocombine the results of the partitions.

This makes mining algorithms to work on a larger data set and also compute theresults in a faster manner.

Measures of the Central tendencies

The measures of the central tendency include mean, median, mode and midrange.Measures of the data dispersion include quartiles, Interquartile range and variance.

The three most commonly-used measures of central tendency are the following.

Mean is the average of all the values in the sample (population) is denoted as.

Sometimes the data is associated with a weight Wi for I ranges from 1 to N. Thisgives a weighted mean. The problem of mean is its extreme sensitiveness to noise. Evensmall changes in the input affects mean drastically. Hence often the top 2% is chopped offand then mean is calculated for a larger data set.

Median is the value where given Xi is divided into two equal halves, with half of thevalues being lower than the median and half higher than the median.

The procedure for obtaining the mean is to sort the values of the given Xi into theascending order. Then if the given sequence has odd number of values, the median is themiddle value. Otherwise the median is the arithmetic mean of two middle values.

Mode is the value that occurs more frequently in the dataset. The procedure forfinding the mode is to calculate the frequencies for all of the values in the data and the modeis the value (or values) with the highest frequency. Normally based on the mode, thedataset is classified as unimodal, bimodal and trimodal. Any dataset which has more thantwo modes is called bimodal.

Example Find the mean, median and mode for the following data (refer Table 2.2).The patient age of a set of records is = {5,5,10,10,5,10,10,20,15}.

NOTES

41



Table 2.2 : Sample Data

The mean of the patient age of this table is 90/9 = 10, median is 10 as it falls in themiddle and mode is 10 as it is the frequent item in the data set. It is sometimes convenientto subdivide the data set using coordinates. Percentiles are about data that are less thancoordinate by some percentage of the total value. For example, median is 50th percentileand can be denoted as Q0.50. The 25th percentile is called first quartile and the 75th percentileis called third quartile. Another measure that is useful to measure dispersion is Interquartilerange.

Interquartile is defined by Q0.75 – Q0.25.

For example, for the previous patient age list, The median is in the fifth position. Inthis case 25 is the median. The first quartile is median of the scores below the mean. Henceit’s the median of the list below 25. In this case, the median is the average of the secondand third value. That is Q0.25 = 12.5. Similarly the third quartile is the median of the valuesabove the median. So Q0.75 is the average of the seventh and eighth score. In this case it is37.5.

Hence the IQR = Q0.75 – Q0.25

= 37.5 – 12.5 = 20.

Semi quartile range is = 0.5 * IQR

= 0.5 * 20 = 10.

Unimodal curves are slightly skewed and the empirical relation is

Mean – Mode = 3 * (Mean – Median)

The interpretation of the formula is that the mode for unimodal frequency curve ismoderately skewed. The mid range is also used to assess the central tendency of thedataset. In a normal distribution, the mean, median, and mode are same. In Symmetricaldistributions, it is possible for the mean and median to be the same even though there maybe several modes. By contrast, in asymmetrical distributions the mean and median are not

DMC 1628

NOTES


the same. These distributions are said to be skewed data where more than half the casesare either above or below the mean. Often skewed data cause problems for data miningalgorithms.

Standard deviation and VarianceBy far the most commonly used measures of dispersion measures are variance and

standard deviation.

The mean does not convey much more than a middle point. For example, the followingdata sets {10,20,30} and {10,50,0} both have a mean of 20. The difference betweenthese two sets is the spread of data.

Standard deviation is the average distance from the mean of the data set to eachpoint. The formula for the standard deviation is given by

Some times, we divide the value by N – 1 instead of N. The reason is that for largerreal-world the division by N-1 gives an answer closer to the actual value.

For example the for the above set, Here N = 3

For the data set 2

2

1( ) / 1

N

iXi N

NOTES

43



The set with larger deviation is dataset 2 because the data is more spread out.

Variance

Variance is another measure of the spread of the data. It is the square of standarddeviation. While standard deviation is more common measure, variance also indicates thespread of the data effectively.

Co-variance

The above two measures standard deviation and variance are one dimensional. Butdata mining algorithms involve data of higher dimensions. So it becomes necessary toanalyze the relationship between two dimensions.

Co-variance is used to measure variance between two dimensions. The formula forfinding co-variance is

Conv (X,Y) =

The covariance indicates the relationship between dimensions using its sign. The signis more important than the actual value.

1. If the value is positive, it indicates that the dimensions increase together2. If the value is negative, it indicates that while one dimension increases, the other

dimension decreases.3. If the value is zero, then it indicates that both the dimensions are independent of

each other.

If the dimensions are correlated, then it is better to remove one dimension as it is aredundant dimension. Also the covariance (X,Y) is same as the covariance (Y,X).

Covariance matrix

Data mining algorithms handle data of multiple dimensions. For example, if the data is3D data, then the covariance calculations involve cov(x,y), cov(x,z), and cov(y,z). In fact

)1N/()YY()XX(n

1ii

DMC 1628

NOTES


N factorial by factorial (n-2) * 2 calculations are required to calculate different covariancevalues.

It is better to arrange the values in the form of a matrix

We can also note that the matrix is symmetrical about the main diagonal.

Shapes of Distributions

Skew

The given dataset may have an equal distribution of data. The data set may also haveeither very high values or extremely low values. If the dataset has far higher values, then thedataset is said to be skewed to the right. On the other hand, if the dataset has far more lowvalues then the dataset is said to be skewed towards left.

The implication of this is that if the data is skewed, then there is a greater chance ofoutliers in the dataset. This affects the mean and median. Hence this may affect theperformance of the data mining algorithm.

Generally for highly-skewed distribution, the mean is more than median many times.The relationship between skew and the relative size of the mean and median can besummarized by a convenient numerical skew index.

Also the following measure is more commonly used to measure skewness.

Kurtosis

The kurtosis is measured using the formula given below. Kurtosis measures how fator thin the tails of a distribution. Often it is described as

Leptokurtic are distributions with long tail and distributions with short tails are calledplatykurtic. Normal distributions have zero kurtosis.

The most common measures of the dispersion data are

Range 5 Number summary

Mediam)3(Mean

3

3)X(

3)x(4

4

NOTES

45



Interquartile range Standard deviation

Range is the difference between the maximum and the minimum of the values.

Kth percentile is the property that the k% of the data lie at or below Xi. In a set ofdata, it is called kth percentile. Median is 50th percentile. Q1 is the first quartile that is 25th

percentile. Q3 is third quartile represented by the 75th percentile. The IQR is the differencebetween Q3 and Q1.

Interquartile percentile = Q3 – Q1

Outliers are normally the values that is falling apart at least by the amount 1.5 * IQRabove the third quartile or below the first quartile.

The median, quartiles Q1, Q3 and minimum and maximum written in the order<Minimum, Q1, Median, Q3, Maximum> is known as five point summary.

Box plots are the popular way for plotting the five number summaries.

2.5 VISUALIZATION METHODS

Visualization is an important aspect of data mining. Visualizing the data summaries asa graph helps the user to recognize and interpret the results quickly. The important informationthat is required by the data mining community shown graphically reduces the time requiredfor interpretation.

Stem and Leaf Plot

A stem and leaf plot is a display that help us to knows the shape and distribution ofthe data. In this method, each value is split into a “stem” and a “leaf”. The last digit isusually the leaf and mostly digits to the left of the leaf form stem. The stem and leaf plot forthe above data of example patient age is shown below (Refer Figure 2.1) below

Stems/leaves

2|0

1|5

1|0000

0|555

Figure 2.1 Stem and Leaf Plot.

DMC 1628

NOTES


Box Plot

Box plot is also known as Box and whisker plot. It summarizes the following statisticalmeasures

Median Upper and lower quartiles Maximum and Minimum data values

The box contains bulk of the data. These data are between first and third quartiles.The line inside the box indicates location – mostly median of the data. If the median is notequidistant then the data is skewed. The whiskers project from the ends of the box indicatesthe spread of the tails and indicates the maximum and minimum of the data value. The plotof a Lung cancer data is given to illustrate the concepts. This data set is used for other typesof graphs also. This data is available as open data set and the data set was published inalong with the paper (Hong, Z.Q. and Yang, J.Y. “Optimal Discriminant Plane for a SmallNumber of Samples and Design Method of Classifier on the Plane”, Pattern Recognition,Vol. 24, No. 4, pp. 317-324, 1991). This data set uses (1 class attribute, 56 predictiveattributes.

A sample box plot of lung cancer data is shown in Figure 2.2.

Figure 2.2 Box plot of the attributes of Lung cancer data

The data outside the whisker is called outlier. Sometimes a diamond box inside thebox shows confidence interval also. Thus the advantages of the box plots are data summaryand its ability to show skewness in data. It also shows the outliers. Box and whisker plotsalso can be used to compare the data sets. The negative side of this plot is its overemphasison the tails of the distribution.

NOTES

47



Histogram

The histogram is a plot of classes and its frequency. The group of data is calledclasses and its frequency count is the number of times the data appear in the dataset (referFigure 2.3). Histogram conveys useful information about the general shape of the distributionand symmetry of the distribution. It can also convey information about the nature of data(like unimodal, bimodal or multimodal).

Perhaps the biggest advantage of histogram is its ability to detect the skewness presentin the data. It can also provide clue to the process and measurement problems associatedwith the data set.

The shape of the histogram is also dependent on the number of bins. If the bin is toowide, then the important details may get omitted. On the other hand if the bins are toonarrow then the spurious data may appear to be genuine information. Hence the user musttry different number of bins to get appropriate bin width. In general, they range from 5 to20 groups of data.

Histogram and scatter plot play an important role in data mining process by showingthe shape of the distribution (as indicated by the histogram) and other statistical properties(as indicated by the box plot).

Figure 2.3 Sample histogram of variable V1 Vs the

count for Lung cancer data set.

DMC 1628

NOTES


Scatter plot

Scatter plot is a plot of explanatory variable and response variable. It is a 2D graphshowing the relationship between two variables. The scatter plot (Refer Figure 2.4) indicates

Strength Shape Direction Outliers presence

It is useful in exploratory data before actually calculating a correlation coefficient orfitting regression curve.

Figure 2.4: Scatter plot of variable V2 Vs V1 of Lung cancer data

Figure 2.5 Sample Matrix Plot

Normal Quartile plots

Normal quartile plot is a 2D scatter plot of percentile of the data versus the percentileof the population. It shows whether the data is from a normalized distribution or not. Thisis considered important because many data mining algorithms assume that data followsnormal distributions. Hence the plot can be used to verify whether the data follows normaldistribution.

NOTES

49



Quartile-Quartile plot

Q-Q plot is a 2D scatter plot of the quartiles of the first dataset and quartiles ofsecond dataset. The data need not have to be same. This is the greater advantage of theQ-Q plot.

These plots check if the datasets can be fit with the same distribution. If two datasetsare really from the same distribution, then the point fall along the reference line (45 Degree).If the deviation is more, then there is greater evidence that the datasets follow differentdistribution. (Refer Figure 2.6 and Figure 2.7 which shows a Q-Q plot where deviationsare more)

Figure 2.6 Normal Q-Q Plot.

DMC 1628

NOTES


Figure 2.7 Normal Q-Q Plot with more Deviations.

In short, this is a useful tool to compare and check the distributions of the dataset if

They follow same distribution They have common location and scale They have similar distributional shape They have similar tail behavior.

2.6 DATA CLEANING

Data cleaning routines attempt to fill up the missing values, smoothen the noise whileidentifying the outliers and correct the inconsistencies of the data. The procedures can dothe following steps to solve the problem of missing data.

1. Ignore the tuple

A tuple with missing data especially the class label is ignored. This method is noteffective when the percentage of the missing values increase.

NOTES

51



2. Filling in the values manually

Here the domain expert can analyze the data tables and carry out the analysis and fillthe values manually. But this is time consuming procedure and may not be feasible for thelarger sets.

1. A global constant can be used to fill in the missing attributes. The missing valuesmay be “Unknown” or “Infinity”. But some data mining results may give spuriousresults by analyzing these labels.

2. The attribute value may be filled by the attribute value. Say the average incomecan replace a missing value.

3. Use the attribute mean for all samples belonging to the same class. Here the averagevalue replaces the missing values of all tuples that falls in this group.

4. Use the most possible value to fill in the missing value. The most probable valuecan be obtained from other methods like classification, decision tree prediction.

5. Some of the methods introduce bias in the data. The filled value may not be correctvalue. It is just an estimated value. Hence the difference between the estimatedand the original value is called error or bias.

Noisy data

Noise is a random error or variance in a measured value. The noise may be removedby the following methods.

Binning method Regression Clustering

Binning is a method where the values are sorted and distributed into the bins. Thebins are also called as buckets. The binning method then uses the neighbor values tosmooth the noisy data.

Some of the techniques that are commonly used are

Smoothing by means : Means replaces the values of the bins Smoothing by bin medians: The bin value is replaced by bin median. Smoothing by bin boundaries

Each bin value is replaced by the closest bin boundary. The maximum and minimumvalues are called bin boundaries. Binning methods may be used as a discretization technique.

DMC 1628

NOTES


Example

By equal-frequency bin method, the data should be distributed across bins. Let usassume the bins of size 3 then

Consider the following set.

S = {12, 14, 19, 22, 24, 26, 28, 31, 34}

Bin 1 : 12 , 14, 19

Bin 2 : 22, 24, 26

Bin 3 : 28, 31, 32

By smoothing bins method, the bins are replaced by the bin means. This methodresults in

Bin 1 : 15, 15, 15

Bin 2 : 24, 24, 24

Bin 3 : 31, 31, 31

Using smoothing by bin boundaries method, the bins value would be like

Bin 1 : 12, 12, 19

Bin 2 : 22, 22, 26

Bin 3 : 28, 32, 32

As per the method, the minimum and maximum values of the bin are determined.Then the values are transformed to the nearest value.

Regression

Data can be smoothened by fitting the data to a function. Linear regression finds the“best line” to fit this attributes. Multiple regressions use more than two variables to fit thedata in the multidimensional surface.

Clusters

Cluster analysis is used to detect the outliers. In clustering process all similar valuesform a cluster. Technically all data that fall apart from normal data behavior is an outlier.

NOTES

53



Data cleaning as a Process

The data cleaning process involves two steps. The first step is the phase of discrepancydetection and in the second phase the errors are removed.

The discrepancies in the data are due to various reasons. They are due to reasonslike poorly designed forms, human errors and typographical errors. Some times the datamay not be present because either the data may not be available or the user may not desireto enter the data for personal reasons. The errors are also due to deliberate errors, datadecay, data representation and inconsistent use, errors in the measuring instruments andsystem errors. These can be detected with knowledge about the specific attributes calledmeta data. Some of the errors are found out by the function overloading by superimposingthe structure on the data. For eg. The format of the date may be mm/dd/yyyy. All thesedata are then analyzed by the business rules.

The business rules may be designed by the business organizations for data cleaning.This may range from

Unique rule : Which states that all the business rules are unique Consequence rule: This rule states that there cannot be any missing values. Null rule: These rules take care of how blanks, special characters should be handled.

The second phase involves removing the errors. Tools can be designed to remove theerrors. Some of the errors can be removed by data scrubber tools. The data scrubbingtools use domain knowledge to remove the errors. Data auditing tools are used to finddiscrepancies in rules and to find data that violate the rules.

2.7 DATA INTEGRATION

Data mining requires merging of data from multiple sources. One of the major problemsis known as entity identification problem. Much of the data may be present under variousheaders in multiple data sources. For example the data present in a table under the attributepatient name may be same as the data that is present in some other table under the title ascustomer name. The merging of these sources creates redundancy and duplication. Thisproblem is called entity identification problem.

Identifying and removing the redundancy is a challenging problem. Such redundanciescan be easily found out by correlation analysis. Correlation analysis takes two attributesand finds correlation coefficient. Correlation coefficient finds the correlation among theattributes. This is applicable for numerical attributes (known as Pearson coefficient) andcategorical data (Chi-square test).

DMC 1628

NOTES


Pearson Correlation coefficient

The Pearson correlation coefficient is the most common test for determining if there isan association between two phenomena. It measures the strength and direction of a linearrelationship between the X and Y variables. If the given attributes are X = (x1, x2,…xn)and Y= {y1,y2,…,yn} then we can say that

And the sample correlation coefficient is denoted by r. The formula for the samplecorrelation coefficient is

Where

1. If r > 0, then X and Y are positively correlated. The higher is the values, and thenhigher is the correlation. By positive correlation, we mean that if X increases thenY also increases.

2. If r = 0, then it implies that attributes X and Y are independent and there exists nocorrelation between X and Y.

3. If r < 0, then X and Y are negatively correlated. By negative correlation, we meanthat if X increases then Y decreases.

Strength of Correlation

Generally the strength of the correlation can be assumed as follows: The correlationis strong if , moderate: 0.5<|r|<0.8 and weak if . The usefulness of this measure is that ifthre is a strong correlation correlation between two attributes, then one of the attribute canbe removed. The removal of the attributes results in data reduction.

Also the value of r does not depend upon the units of measurement. This means theunits like centimeter, meter are all irrelevant. Also the value of r does not depend uponwhich variable is labeled X and which variable is labeled Y.

Correlation and Causation

Correlation does not mean that one factor causes other. For example the correlationbetween economical background and marks scored does not imply that economicbackground caused high scoring of marks. In short, correlation is different from causation.

yx

xy

yyxx

xy

sss

sss

r

n

1iiixy

n

1i

2iyy

n

1i

2ixx

).yy)(xx(s

)yys

)xx(s

NOTES

55



Chi-Square test

Chi-Square test is a non-parametric test. This is used to compare the observedfrequency of some observation (Such as frequency of buying different brands of TV) withthe expected frequency of the observation (of buying of each TV by customers). Thiscomparison is used to calculate the value of the chi-square statistic. This is then comparedwith the distribution of chi-square to make an inference about the problem.

X 2 =

Here E is the expected frequency, O is the observed frequency and the degree offreedom is C – 1 where C is number of categories.

The uses of the Chi-square tests are

To check whether the distribution of measures are same To check goodness-of-fit.

Consider the following problem. Let us assume that in a college, both boys and girlsregister for a data mining course. The survey details of 140 students are tabulated in Table2.3. The condition to be tested out is whether any relationship exists between the genderattribute and registrations of the data mining course.

Table 2.3 Contingency Table

The expected value can be calculated as per the formula

E11 = Count (Male) X Count (Data Mining) / N

= 120 X 80 / 210 = 45.1

E12 = Count (Female) X Count (Data Mining) / N

= 90 X 60 / 210 = 25.71

E11 = Count (Male) X Count (Not Data Mining) / N

= 120 X 40 / 210 = 22.86

E

EO 2)(

DMC 1628

NOTES


E11 = Count (Female) X Count (Not Data Mining) / N

= 90 X 30 / 210 = 12.86

Then X2 = (80-45.71)2/45.71 + (60-25.71)2/25.71 +

(40-22.86)2/22.86 + (30-12.86)2/12.86

= 104.07 Degrees of freedom = (Row-1) (Col-1) = 1

Statistical tables can be referred for column for 0.05 for degree of freedom = 1, thenthe critical value is 10.828.

Since the value 104.07 is greater than 10.828, there is no relationship between genderand registering for the data mining course.

Now the reasoning process can be started like this

State the Hypothesis H0: O = E H1: O is not equal to E

Set the levelAlpha = 0.05

Calculate the value for the appropriate statisticX2 = 10.828Degree of freedom = 1

Write the decision rule for rejecting the null hypothesisReject H0 if X2 >= 10.828

Reject Ho for p < 0.05

Since the threshold value is greater than 10.828, we can reject the null hypothesis andaccept the alternate hypothesis.

The chi-square tests allow us to detect the duplication of data and help is to removethe redundancy of values.

2.8 DATA TRANSFORMATION

Many data mining algorithms assume that variables are normally distributed. But thereal world data are not usually normally distributed. There may be valid reasons for thedata to exhibit non-normality. It is also due to things such as mistakes in data entry, andmissing data values. This significant violation of the assumption of normality can seriouslyaffect data mining algorithms. Hence these non-normal data must be conditioned so thedata mining algorithms functions properly.

NOTES

57



There are methods to deal with non-normal data. One such method is datatransformation which is the application of a mathematical function to modify the data.These modifications to the values of a variable include operations like adding constants,multiplying by constants, squaring or rising to a power. Some data transformations convertthe data to logarithmic scales, inverts, reflect and some times apply trigonometrictransformations to make it suitable for data mining.

Some of the data transformations are given below

Smoothing: These algorithms remove noise. This is achieved by methods like binning,regression and clustering.

Aggregation: This involves applying the aggregation operators to summarize the data.

Generalization: This technique involves replacing the low level primitive by a higherlevel concept.

Normalization: The attribute values are scaled to fit in a range (say 0-1) to improvethe performance of the data mining algorithm.

Attribute construction: In this method, new attributes are constructed from the givenattributes to help the mining process. In attribute construction, new attributes are constructedfrom given attributes which may give better performance. Say the customer height andweight may give a new attribute called height-weight ratio which may provide further detailsand stability as its variance is less compared to the original attributes.

Some of the procedures used are

1. Min-Max normalization2. Z-Score normalization.

Min-Max Procedure

Min-Max procedure is a normalization technique where each variable X is normalizedby its difference with the minimum value divided by the range.

Consider the set S = {5,5,6,6,6,6,88,90}

X* = X – min(X) / Range (X)

= X – Min (X)/ Max(X) – Min(X)

For example, for the above set the age 88 is normalized as 0.976(88-5/85). Thus theMin-Max normalization range is between 0 and 1.

DMC 1628

NOTES


Z-Score normalization

This procedure works by taking the difference between the field value and meanvalue and by scaling this difference by standard deviation of the attribute.

X* = X – Mean / Standard Deviation (X)

For example, the patient age for the point 88, the Z-score is

X* = 88 – mean(X) / Standard deviation (X)

The range of Z-score is between -4 to +4 with the mean values having the Z-score ofzero.

Z-scores are used to detect outlier detection. If the data value z-score function iseither less than -3 or greater than +3, then it is possibly an outlier. The major disadvantageof z-score function is that it is very sensitive to outliers as it is dependent on mean.

Deviation Scaling

This is a procedure where the decimal point is moved but still preserves the mostdigital value. Here the data point is converted to a range of -1 to +1.

X* = X(I)/ 10k

For example, the patient age 88 is normalized to 0.88 by choosing k=2. Datatransformations especially neural networks carry out data transformations inherently. Butother data mining algorithms, the data transformations should be carried out separately.

2.9 DATA REDUCTION

Sampling is a statistical procedure of data reduction. The statistics starts by identifyingthe group for carrying out a study. For example, a company may want to know whether theproducts are popular. It is difficult to survey the entire population living in an area. Hencethe company wants to identify a group for carrying out the study. It is called population.Hence the strategy is to select a group of samples and choosing a representative data fromit. (Refer Figure 2.8).

NOTES

59



Figure 2.8: Sampling Process

Samples are representatives of the population. If the samples are not representative,then it will lead to bias and can misrepresent the facts by wrong inference.

How samples are selected?

It is better to avoid the volunteer sample. Because volunteer sample often lead tobias. Convenience sample is a sample that researcher choose often because the samplesare available at the right time. Say, the company wants to select the people who live nearto the company. In this case the sample may be still representative.

What should be the ideal size of the sample?

The greater emphasis is on the representative nature of the samples. If the samplesare too less, the data would be biased. But if the samples are too large, then it defeats thevery concept of sampling. Hence it should be of optimal size and the size often affectsquality of the results.

Data reduction techniques reduce the dataset while maintaining the integrity of theoriginal dataset. The results are same as the original.

Data cube reduction

Data cubes store multidimensional, aggregated information. The base cuboidscorrespond to the individual entity of interest. A cube at the highest direction is called apexcuboid. Data cubes are created at different levels of abstraction are often referred ascuboid. Every level reduces the dataset size. This helps in reducing the data size for miningproblems.

Attribute subset selection reduces the dataset size by removing irrelevant featuresand construct a minimum set of attributes for data miners. For n attributes, there are possiblesubsets. Choosing optimal attributes becomes a graph search problem. Typically featuresubset selection problem uses greedy approach by looking for the best choice at the timeusing locally optimal choice while hoping that it would lead to global optimal solutions.

DMC 1628

NOTES


Statistical tests are conducted at various levels to determine statistical significance forattribute selection. Some basic techniques are given below

Stepwise forward selection

This procedure starts with an empty set of attributes. Every time an attribute is testedfor statistical significance for best quality and is added to the reduced set. This process iscontinued till a good reduced set of attributes are obtained.

Stepwise backward elimination

This procedure starts with complete set of attributes. At every stage, the procedureremoves a worst attribute from the set leading to the reduced set.

Combined approach: Both forward and reverse methods can be combined so thatthe procedure can add the best attribute and remove the worst attribute.

Decision Tree induction

The feature reduction algorithm may use the decision tree algorithms for determininga good set. The decision tree is constructed such that decision tree uses information gain topartition the data information into individual classes. The attributes that don’t appear in thetree are assumed irrelevant and removed from the reduced set.

Data Reduction

The data can be reduced using some standard techniques like principal componentanalysis, factor analysis and multidimensional scaling.

The goal of PCA is to reduce the set of attributes to a newer smaller set that capturesthe variability of the data. The variability is captured by a fewer components which wouldgive the same result compared to the original result with all the attributes.

The advantages of PCA are

1. It identifies the strongest patterns in the data.2. It reduces the complexity by reducing the attributes3. PCA can also remove the noise present in the data set.

The advantages of PCA are to satisfy the following properties

1. Each pair of distinct attributes have zero variance2. PCA orders the attributes based on the variance of the data3. The first attribute captures the most variance of the data, the second component

much less and so on based on the orthogonal requirement.

NOTES

61



The PCA algorithm is stated below

1. Get the target data set.2. Subtract the mean. The mean of the data is subtracted from the data dimensions.

That is is subtracted from X and is subtracted from Y values. This produces adata set with zero mean.

3. Covariance is calculated.4. Eigen values and Eigen vectors of the covariance matrix are calculated.5. Eigen vector of the highest Eigen value is the principal component of the data set.

The eigen values are ordered in the descending order. Then the feature vector isformed with these eigen vectors in the columns.Feature Vector = {Eigen vector1, Eigen Vector2, …, Eigen Vectorn}

6. The component that represents the variability of the data is preserved .

The final data is = {Row Feature Vector} X Row Data Adjust

Row feature vector is the eigen vectors in the columns transposed and Row dataadjust is mean adjusted data transposed matrix.

At any time, if the original data is desired, It can be obtained using the formula

Original Data = { Row Feature vector T X Final Data} + Mean

The new data is a dimensionally reduced matrix that represents the original data.Therefore PCA is effective in removing the attributes that does not contribute much. Alsoif the original data is desired , it can be obtained thereby no information is lost.

Scree plot is a visualization technique to visualize the principal components visually.

The Scree plot of the Lung cancer data is given in Figure 2.9.

Figure 2.9: Scree Plot of Lung Cancer Datat

DMC 1628

NOTES


It can be seen that the majority of the data is represented by first three attributesmore. For example, the contribution of the sixth attribute is less compared to the first threeattributes (Refer 2.10).

Figure 2.10: Few Components of the Attribute.

Scree plot illustrates the distribution of the data for the first components.

Factor analysis

Unlike PCA, Factor analysis expresses the original attribute as linear combinations ofa smaller number of latent or hidden attributes. Factor analysis assumes that the observedcorrelation is due to some underlying patterns. The contributions of those factors are splitinto two common components. One component is the factor that is underlying all the factorsand another component is special noise component.

The sum of these components is called communality. Unique variability is excludedfrom the analysis and common variability that is splitted are studied in factor analysis.

Let f1, f2,…,fn be the latent factors. Let the original matrix be M X N and the latentfactors be M X P matrix.

The standard factor analysis assumes the relation

Di* T

= A i*T +

where D is a original matrix and i* is ith row of the matrix. Fi* is the correspondingrow of the new data matrix F. A is an N X P matrix that indicates the factor loadings whichexpresses the relationship between original values and latent factors. is the error thataccounts for error that is not accounted by the common factors.

NOTES

63



Multidimensional Scaling

The goal of Multidimensional Scaling (MDS) is to reduce the dimensions like PCAand FA. It is done by a projection technique where data is mapped to a lower dimensionalspace that preserves the distance between the objects.

The MDS approach accepts the dissimilarity matrix as input and projects it to a p-dimensional space. The input matrix entry dij is the distance between ith object and jthobject. The MDS projects the data to a new data such that the stress is minimized.

Stress = /

The Euclidean distance based MDS is similar to PCA.

Numerosity reduction

Parametric models

These models are used to estimate the data parameters. Then the data parametersare stored instead of actual data. Log linear model is a good example of these types.

Non parametric models store reduced representations of the data that includeshistograms, clustering and sampling.

Regression and log-linear models

This model can approximate data. In linear regression, the data can be modeled to fitthe line as

Y = AX + B where Y is a response variable and X is a predictor variable.

The parameters a and b are slope and y-intercept respectively. These are calledregression coefficients. The values of the coefficients can be obtained by using the methodof least squares by minimizing the error between the actual and predicted data. Multipleregressions involve two or more predictor variables. Regression methods are inexpensive,inexpensive and can handle skewed data very well.

Log-linear models construct a higher dimensional space from lower dimensions. Atuple can be considered as a point in N-dimensional space. Log-linear models estimate theprobability of each point for a set of points based on a smaller subset of dimensions.Hence these are used for the dimension reduction.

Histograms

Histograms use bins to represent data dimensions. Histogram partitions the attributeA into disjoint buckets. If the bucket has single value, it is called singleton buckets. Butoften buckets hold continuous ranges for a attribute.

,( (( ' ) )2

i jdij dij 2dij

DMC 1628

NOTES


The buckets and the attribute values partitioned based on the following criteria.

Equal width: The width of the bucket is uniform

Equal frequency: Frequency of each bucket is considered

V-optimal: Based on the least variance, the histogram is constructed. The histogramvariance is a weighted sum of the bucket values. The bucket weight is equal to the numberof values in the bucket.

MaxDiff: The difference between each pair of adjacent values.

MDS histograms can capture dependencies between attributes.

Clustering

They partition the objects into groups or clusters. All objects within the cluster aresimilar and objects in one cluster are dissimilar to the instances of the other clusters. Thecluster quality is determined by its diameter, centroid and distance. In data reduction, theclusters tend to replace the actual data. The nature of data determines the effectiveness ofthis method.

Sampling

Sampling can be used as a technique for reducing the data. It takes random samplesfrom the total data. The sample chosen should be representative in nature.

A simple random sample without replacement

This is done by drawing S of the N tuples from D. (S < N), where the tuple can bechosen equally likely.

A Simple Random sample with replacement

This is same as the previous technique but the replaced tuple is recorded. This replacedtuple may be drawn from the sample again.

Cluster sample

Here the samples are grouped into disjoint clusters. Then the samples are drawn fromthe clusters.

Stratified sample

Here the population D is divided into disjoint strata. Say the customer database isdivided into different strata based on age. Then stratified sample are drawn from eachstrata. This helps to ensure a representative sample is chosen. This technique is ideallysuited for skewed data.

NOTES

65



The advantages of the sampling are

1. Reduced set of processing and2. Reduced complexity.

DATA DISCRETIZATION AND CONCEPT HIERARCHY

GENERATION TECHNIQUES

The process of converting the continuous attribute to discrete attribute is calleddiscretization. This leads to a concise and easy to use knowledge representation.Discretization can be categorized based on

1. Direction as Top down Vs Bottom up2. Class Information as Supervised Vs Unsupervised

Unsupervised methods starts by finding one or few points called split points or cutpoints. These points are used to split the continuous variable range into discrete range.This procedure is recursively applied on the resulting intervals. Bottom up methods on theother hand considers all of the continuous values as potential split points, removes someneighboring points by merging them to form intervals and this procedure is recursivelyapplied till a “stable” intervals are obtained.

Concept hierarchy is then applied for the discretized value. Concept hierarchy givesa concept for discrete values. This approach leads to loss of information, but the generalizeddata is more meaningful and easy for interpretation.

Manual Methods:

Here the prior knowledge of the feature is used to determine

1. Cut off points2. To select the representatives of the intervals.

For example, the grades can be used to split the continuous marks as (Refer Table 2.9).

Table 2.4 : Sample Discretization Measure

DMC 1628

NOTES


Without the prior knowledge, it is difficult to discretize. This facilitates the reduction incomputational complexity and many data mining algorithm works only with discretizedvalues. However for a large databases and applications for which there is no priorknowledge, these manual methods are not feasible.

Binning

Binning does not use class information. So it is a unsupervised technique. Here theconcept of equal-width or equal frequency binning is used. The bins are replaced by thebin means or bin median. These techniques are used to partition the bin recursively leadingto a concept hierarchy (Refer Table 2.11).

Figure 2.11: Binning Process

Entropy based discretization

This method is supervised, top-down splitting technique. It uses class information anddetermines the split point. The criterion used for selecting the split-point is based on minimumentropy

Each value of A can be considered as potential split point. A split point is decided andpartitions the tuples satisfying the condition. The tuples satisfying the condition A <= splitpoint and A > split point. This process is called binary discretization.

The method of splitting based on entropy is given as below

Let us assume that there are two classes C1 and C2. The tuples from C1 form onepartition and tuples from C2 form another partition. But this is unlikely as the tuples mayhave mixed tuples. So the splitting point for perfect calculation is based on expectedinformation requirement.

For the above case it is given by

InfoA (D) = Mod(D1)/Mod(D) * Entropy (D1) + Mod(D2)/Mod(D) * Entropy (D2)

NOTES

67



D1 and D2 are the tuples that satisfy the condition A <= Split point and A > Splitpoint. The entropy can be calculated as

Entropy (D) = -

Where pi is the probability of classes Ci determined by number of tuples divided bythe total number of tuples.

The split point would be min (InfoA (D))

This process is recursively applied to each partition till the minimum informationrequirement on all candidates is lesser than or small threshold value ε or when the numberof intervals is greater than a threshold value α

Interval merging with Chi-square analysis

This algorithm analyzes the quality of multiple interval for a given feature using the chi-square method.. The algorithm determines the similarity of the data of the adjacent intervals.If the feature intervals are independent i.e., if the difference between the intervals arestatistically significant then there is no merger of the intervals.

The basic algorithm of the chi-merge is given as below

Sort the data of the given feature in an ascending order Define the intervals so that every value of the feature is in different interval Repeat until no chi-square test results of the adjacent intervals is less than the

threshold value ε

Cluster analysis

Cluster analysis can be applied to group the values of A into clusters. Then the clusteringalgorithm can generate a concept hierarchy by following either top-down approach orbottom-up approach.

Intuitive partitioning

This algorithm also based on a rule called 3-4-5 rule. It is used to partition the numericranges into intervals that is suitable for concept hierarchy generation.

In the interval covers 3,6,7, or 9 distinct values in the most significant digit, then therange is partitioned into three intervals.

If the most significant digit covers 2,4, or 8 distinct values, then the range is partitionedinto four equal-width intervals.

If the most significant digit covers 1,5, or 10 distinct values, then the range ispartitioned into five equal width intervals.

i log pi p

DMC 1628

NOTES


This rule is recursively applied to each interval creating the concept hierarchy. Thealgorithm detects the 5th percentile as low and 95th percentile as high. The values of lowand high are rounded up or down based on the most significant digit. Then the intervalranges are checked for distinct values and the rule of 3-4-5 is applied recursively.

Concept hierarchy for categorical data

Categorical data takes value from a finite number of distinct values. For examplelocation and job category are examples of categorical data. The concept hierarchy can begenerated by the following technique.

Specification of values explicitly at the scheme level by the domain experts. Saythe revenue system can be specified asTaluk-> District -> State -> region -> Country.

Specification of a portion of a hierarchy by explicit data groups.For a larger dataset, it is often difficult to define concept hierarchy. So for a small portion of the intermediate level data, concept level hierarchy can be specified.

Specification of a set of attributes, but not their partial ordering

Based on the frequent level concept hierarchy is decided. For example country levelis more distinct than the street level. That is the instances of country are few compared tothe street.

In the absence of the specification of the user, the database schema may drag anentire hierarchy.

2.10 MINING ASSOCIATION RULES IN LARGE DATABASES

The task of association rule mining is to find interesting association rules or relationshipsthat exist among a set of items present in the transactional data base or data bases. Thetransactional database has a set of attributes having transactions ID and a set of items.

This task of discovering association rules was first introduced in 1993 and is usuallycalled as “Market basket analysis”. For example, shop owner always speculate aboutcustomer buying behaviors. The business organizations are interested in knowing the factslike groups of customers, buying behavior etc. For example a customer who buys milklikely to buy bread also. Similar associations between diapers and beer are also noted.This helps supermarkets to organize the shop layout so that these items can be kept togetherto increases sales.

This analysis is quite valuable for many cross-marketing applications. Also this helpsthe shops to organize the catalogues, design stores layout and customer segmentation. It isquite useful in other areas say in medical diagnosis where crucial associations betweenattributes like effect of drugs on patients cure can be discovered.

NOTES

69



If we mine the transactional data base , we get a rule of the form A -> B. Thisindicates that there is a consequent of B for every occurrence of A. Patterns that existsfrequently is called frequent patterns. The set of items that is present together is calledfrequent itemset. The subsequence, say a user buy item1 after item2 frequently is calledsequence pattern. The underlying structure that is often present in the transactional databasein the form a subgraphs, subtrees is called structure pattern. ARM (Association rule mining)thus discovers the hidden or uncover the associations, correlations, and other importantassociations that is present in the data.

The association relationships are expressed in the form of IF – THEN rules. Eachrule is associated with two measurements – support and confidence. Confidence is ameasure of strength of the rule and support indicates the statistical significance.

Major problems of association mining are

Mining a very large database is computationally intensive process The associational rules may be spurious

Binary representations

Market basket data is stored in a binary form where each row corresponds to atransaction and each column corresponds to an item. The item is treated as a binary value.Its values are one if the item is present and zero if the item is absent. This is a simplistic viewof the market basket data.

Formally the problem can be stated as below.

Let I = { I1, I2,…In} be a set of literals called items. Let D be a database which is acollection of transactions T = {t1,t2,…,tn} where each transaction is a set of items suchthat T I. Associated with each item is transaction ID.

As association rule is of the form x Y, where X and Y are I and X Y = . X is calledantecedent and Y is called consequent.

Every association rule is associated with a user specified parameters, support andconfidence. Support indicates the frequency of the pattern and strength of the rule is indicatedby confidence.

Support for an association rule X Y is the percentage of the transactions thatcontains X Y in the data base.

The confidence of an association rule is the ratio of the number of transactions thatcontains X Y to the number of transactions that contains X. This is also known as strength.

The common approach of the association rules problem consists of two phases

1. Find large Itemsets

DMC 1628

NOTES


2. Generate the rule from frequent itemset selected in phase one.

But the main problem here is that the computational explosion of the values. Forexamples, an itemset of five nodes can generate 2 5R = 3d- 2d + 1= 32 rules. In general, ifthere are d unique items, we can form R number of rules where

The complexity is reduced by reducing the candidate Itemsets and by reducing thenumber of comparisons. If X Y, then every item of X is contained in Y but there is atleastone item of Y that is not present in X.

An itemset X is called closed in a dataset D if there is no proper super itemset Y suchthat Y has the same support count as X in D. An itemset X is called closed frequent itemseton set D if X is both closed as well as frequent in D. An itemset X is a maximal frequentitemset, if there is no super itemset Y and Y is frequent in D. A set is called closed if itssupport is different from support of its superset.

The frequent itemsets algorithms can be classified based on many criteria. They are asshown below

2.10.1 Apriori Algorithm

Phase One: Enumerate all itemsets in the transaction database where support is greaterthan a minimum support. These itemsets are called frequent itemsets and minimum supportis user defined percentage.

Phase two: The algorithm generates all rules from the frequent itemsets whose minimumconfidence threshold which the user has specified.

The Apriori principle is if the item set is frequent, then all of its subsets are also frequent.

The strategy of pruning the exponential search space is based on support is calledsupport-based pruning. Also the support never exceeds the support of the subsets. This iscalled anti-monotone property of the support measure.

K = 1 C1 = { {I} / I I }

While Ck do begin

Count the support of Ck in D

Frequency k = { S Ck / Support (S) > minimum support};

Ck+1 = Candidate generation (Fk)

K:=K+1

End

NOTES

71



F = F1 F2 F3 … F k-1

Ck = { Possible frequent itemsets of size K}

Where D is database and Fk is the itemset whose minimum support is reached.

Rules generation Phase

Once the frequent itemsets are generated, then it is used to generate rules. The objectiveis to create for every frequent itemset Z and its subsets X, a rule X -> Z – X and include itin the result if Support (Z)/ Support (X) > minimum confidence.

For all Z F do

R = (Resulting rules}C1 = {{i} / I Y} // Candidates of 1-consequantsK = 1While Ck do begin

Fk = {X Ck / Confidence of (X Z-Z is > Minimum confidence}R = R {X Z – X / X Fk}

Ck+1 = Candidate generation (Fk);K = k+1

EndEnd

Consider a transaction table (Refer Table 2.5)

Table 2.5: Transaction Item Table

DMC 1628

NOTES


C1 = {A}, {B}, {C},{D},{E}

Support value = 3/5, 4/5, 4/5,1/5,4/5

Minimum support is 2/5 then the resultant is

F1 = { {A}, {b},{C},{E}} The possible set of two items are listed in Table 2.6.

Table 2.6: Two Items set.

Again based on the minimum confidence, the emerging itemsets are tabulated in Table 2.7.

Table 2.7: Possible Three Item set

{A,B,E} is rejected because its confidence is only 1/5. {A,B,C,E} is frequent only ifall its subsets are frequent. But {A,B,E} is not frequent. So the iteration stops.

Computational complexity of the algorithm can be affected by

Support thresholdNumber of ItemsNumber of transactionsAverage transaction width

2.10.2 Modifications of Apriori methods

FP-Tree algorithm is a modification of Apriori algorithm. The main idea of the FP-Tree algorithm is to store all the transactions in a trie data structure. In this way everytransaction is stored and each item is a linked list linking all transactions where the item ispresent. Hence the prefix that is shared by many transactions is stored only once. This

NOTES

73



advantage is called single prefix path. That is even if the prefix is removed; all subsets ofthat prefix can still be found and added.

The algorithm for constructing the FP-Tree is given below

1. Create a root labeled Null2. Scan the databaseProcess the items in each transaction in orderFrom the root add nodes from the transactions

Link the nodes representing the items along different branches (Refer Figure 2.12)

Figure 2.12: FP-Tree

The algorithm can be written as follows

Step 1 : Form a conditional pattern treeStep 2 : Form a conditional FP-TreeStep 3 : Recursively mine the FP-Tree

The above procedure is carried out like this. Start from the last item in the table. Findall the paths containing the item by traversing all the links. Patterns in the paths with requiredfrequency is called conditional patterns. Then based on this the conditional tree isconstructed. Append all the items to all the paths in the tree generating the frequent pattern.Thus the tree FP-Tree can be mined recursively and then the items are removed from thetable and tree.

Compact representation of frequent Itemsets

1. Compressed representation of input data2. Reads a transaction and maps transactions onto a path in the FP-tree3. If more than one paths overlap, then more compressed representation it will be.

concept hierarchy are used.

DMC 1628

NOTES


If the FP-Tree can be fitted in the main memory, then the frequent itemsets can bederived from the structure itself rather than by scanning the database.

The association analysis can be extended to

1. Correlation Analysis2. Causality Analysis3. Max patterns and frequent closed itemset4. Constraint based mining5. Sequential patterns6. Periodic patterns7. Structural patterns.

The frequent itemsets algorithms can be classified based on many criteria. They are as

shown below

For example (Age = X1 education = X2) -> Buys(X,Y)

NOTES

75



2.10.3 Evaluation Measures

The measures to evaluate the association rules can be classified into two types

1. Objective measure2. Subjective measure

Subjective measures are difficult to incorporate into algorithm. Normally people usevisualization, template based approaches and subjective Interestingness measures likeconcept hierarchy are used.

Objective measures use mathematical formulae to find the interesting measures.Support and confidence are such measures. But the limitation of this approach is lowsupport may eliminate many interesting patterns and relationships. Also low confidenceignores the support of the itemset in the rule consequent.

Contingency table can be made for finding interestingness. A sample contingencytable is shown here (Refer Table 2.8)

Table 2.8: Contingency Table

DMC 1628

NOTES


Then from the contingency table the correlation can be obtained

This is called correlation relationship. It is -1 for negative correlation, +1 for positivecorrelation and 0 if they are independent.

For binary variables

Interestingness (A,B) = S(A,B)/

= A .B /|A|*|B| = Cosine (A,B)

Also the geometric mean

Interestingness (A,B) ==

Interest Factor

Lift = C(A B) / S(B)

= Rule confidence/ Support of the itemset in the rule consequent

For binary variables, this is called interest factor

I (A,B) = S(A,B)/ S(A) * S(B) = N * D11/ F1+ + F+1

I(A,B) = { 1 if A and B are independent

>1 if A and B are positively correlated

<1 if A and B are negatively correlated.

The advantage of the objective measures are that it has strong mathematical foundationand hence suitable for data mining. The disadvantage is that the value can be very largeeven for uncorrelated and negatively correlated patterns.

Summary

Data Quality is crucial for the quality of results for data mining algorithms The types of dirty data are Incomplete data, Inaccurate data, duplicate data and

outliers Data preprocessing involves four crucial stages – Data cleaning, Data Integration,

Data transformation and Data reduction. The measures of the central tendency include mean, median, mode and midrange. The most commonly used central tendencies are mean, median and mode.

0))F1)F0)(F(F1

FF FF As 10010011

( ( , ) / ( ))*( ( , ) / ( )S A B S A S A B S B

( )* ( )C A B C B A

NOTES

77



The most common dispersion measures range, 5-number summary, Interquartilerange and standard deviation.

Visualization is an important aspect of data mining which helps the user to recognizeand interpret the results quickly.

Noise is a random error or variance in a measured value and can be removed bybinning method, clustering and regression models.

Redundant attributes can be removed by observing the correlation coefficient. Chi-square technique is a non-parametric test used to test the null hypothesis. Some of the important data transformations are smoothing, aggregation,

generalization, normalization and attribute construction. Data reduction techniques reduce the dataset while maintaining the integrity of the

original dataset. The continuous attributes can be converted to discrete attributes using discretization Association rule mining finds interesting association rules or relationships that exist

among the set of items present in the transactional database or databases. Support of an association rule X -> Y is the percentage of transitions that contains

X U Y in the database. The confidence of an association rule is the ratio of thenumber of transactions that contains X U Y to the number of transactions thatcontains X.

The main idea of FP-Tree algorithm is to use trie data structure to reduce therepeated database scan.

Association rules can be evaluated using objective measures like Interestingness,Interest factor or by subjective measures like utility.

DID YOU KNOW?

1. What are the data types?2. What is the difference between noise and outlier?3. What is the difference between variance and covariance?4. What is the relationship between mining and data base scan?

Short Questions

1. What are the types of dirty data?2. What are the stages of data management?3. What are the types of data preprocessing?4. What is the need for data preprocessing?5. What is the need for data summarization?6. What are the measures of central tendencies?7. Why do central tendency and dispersion measures are important for data miners?8. What are the measures of skewness and kurtosis?9. How Interquartile range is useful in eliminating outliers?10. List out the visualization aids available for exploratory data analysis?11. What is the use of correlation coefficient for data mining?12. What is meant by “curse of dimensionality”?13. What are the methods available for data reduction?

DMC 1628

NOTES


14. What is meant by numerosity reduction?15. What is meant by sampling? What are the types of sampling?16. What is meant by discretization?17. What is meant by concept hierarchy?18. What is meant by support and confidence in association rule mining?19. How frequent itemsets algorithm can be classified?20. State Apriori principle?21. What are the factors affecting the computational complexity of apriori algorithm?22. How does one evaluate the association rules?23. What is the difference between objective and subjective measures?24. What is meant by “Interestingness”?25. What is meant by lift factor?

Long Questions

1. Explain in detail the various stages of data management.2. Explain in detail the different categories of data preprocessing.3. For a given set

S = {5, 10, 15, 20, 25, 30} of marks. Find mean, median, mode, standard deviationand variance.

4. Consider the above set and can we say the number 90 outlier or not. How doesone prove that?

5. Consider an attribute S = {15,20,25,60,70,75,90,95,100}. Apply binningtechniques and remove the noise present in the data.

6. Consider an attribute whose values are S1= {5,10,20,40} and S2= {1,4,6,7}. Itis decided to merge the attributes. Can it be justified using correlation coefficient.

7. Consider the following contingency Table 2.9 of test taking behavior

Table 2.9: Test Taking Behaviour.

Is there any relationship between Gender and test taking behavior?

8. Explain in detail the principal component analysis method for data reduction.9. Explain in detail the discretization methods.10. Explain in detail apriori algorithm.11. Consider the following transaction database Table 2.10.

NOTES

79



Table 2.10: Sample Transaction table

Assume the minimum confidence and support level is 50% and generate the rules forapriori algorithm.

12. Explain in detail the FP-Tree algorithm.

DMC 1628

NOTES


NOTES

81



UNIT III

PREDICTIVE MODELING

INTRODUCTION

Classification is a supervised learning method. It attempts to discover the relationshipbetween the input attributes and the target attribute. Clustering is a method of grouping thedata to form clusters. This chapter focuses on the issues of design, implementation andanalyzing classification and clustering process.

Learning Objectives

To Explore the types of classifiers To Explore classifier types like decision trees, Bayesian classifiers and other classifier

types To study about the evaluation of classifiers To explore the clustering algorithms To analyze and perform evaluation of clustering algorithms

3.1 CLASSIFICATION AND PREDICTION MODEL

Classification is a supervised learning method. The input attributes of the classificationalgorithms are called independent variables. The target attribute is called dependent variable.The relationship between the input and target variable is represented in the form of astructure which is called a classification model. The classification problem can be statedformally as

Formal Definition

Given a database D which consists of tuples 1 2 nt , t , , t . The tuple has input

attributes 1 2 na , a , , a and a nominal target attribute Y from an unknown fixeddistribution D over a labeled space. The classification problem is to define a mappingfunction which maps the D to C where each ti is assigned to a class with minimumgeneralization errors.

DMC 1628

NOTES


The classes must be

Predefined Non-overlapping Partition the entire database

For example, an email classifier may classify incoming mails into valid and invalid mail(spam). It can classify an incoming email into spam if

It comes with unknown address Possibly involve Marketing source that user don’t want to receive. Possibly mails may have viruses/Trojans that user don’t want to receive. Possibly involve contents that are not suitable for the user.

Email classification program may classify mails as valid or spam based on theseattributes.

Classification model is both descriptive and predictive. The model can explain thegiven dataset and it can be used to classify the unknown attributes also as a predictivemodel.

Classification models can be constructed using the regression models. Regressionmodels map the input space into a real value domain. For example the regression modelscan be used not only predict the class labels but also other variables which can be a realvalued. Estimation and prediction are viewed as types of classification. For example guessingthe grade of the student is a classification problem. Prediction is a task of classifying aninstance into a set of possible classes. While prediction involves the continuous variable,classification generally restricts to the discrete variable.

The classification model is implemented in two steps.

1. Training phase2. Testing phase

In phase 1, a suitable model is created with a large set of training data. The trainingdata has all possible combinations of data. Based on the data, a data-driven classificationmodel is created. Once a model is developed, the model is applied to classify the tuplesfrom the target database.

3.2 ISSUES REGARDING CLASSIFICATION AND PREDICTION

The major issues that are associated with the implementations of this model are

NOTES

83



1. Over fitting of the model

The quality of the model depends on the amount of good quality data. However if themodel fits the data exactly, then it may not be applicable to a broader population of data.For example, a classifier should not be developed with training data more than necessary.Otherwise it leads to generalization error.

2. Missing data

The missing values may cause problems during both training and testing phases. Missingdata forces classifiers to produce inaccurate results. This is a perennial problem for theclassification models. Hence suitable strategies should be adopted to handle missing data.

3. Performance

The performance of the classifier is determined by evaluating the accuracy of theclassifier. The process of classification is a fuzzy issue. For example, classification of emailsrequires extensive domain knowledge and requires domain experts. Hence performanceof the classifier is very crucial.

3.3 TYPES OF CLASSIFIERS

There are different methodologies for constructing classifiers. Some of the methodsare given below

Statistical methods

These methods use statistical information for classification. Typical examples areBayesian classifiers and linear regression methods.

Distance based methods use distance measures or similarity measures forclassification.

Decision tree methods are another popular category for classification using decisiontrees.

Rule based methods use IF-THEN rules for classification.

Soft computing techniques like neural networks, genetic algorithms and rough settheory are also suitable for classification purposes.

In this chapter, the statistical methods and decision tree methods are explored in adetailed manner and an overview of other methods are briefly explained.

3.3.1 Decision Tree Induction

A Decision Tree is a classifier that recursively partition the tuples. For example, fordetermining whether a mail is valid or spam, a set of questions are repeatedly asked like

DMC 1628

NOTES


Is it from any legal source?

Is it contains any illegal program content?

Each question results in a partition. For example, Figure 3.1 shows a split.

Figure 3.1 Show of Split

The set of possible questions and answers are organized in the form a DT which isa hierarchical structure of nodes and edges. Figure 3.2 shows the structure of a decisiontree. It can be noted that the tree has three types of nodes

Figure 3.2 Sample Decision Tree

Root

The root node is a special node of the tree. It has no incoming edges. It can have zeroor more outgoing edges.

Internal nodes:

These nodes, like root node are non-terminal nodes. It contains the attribute testconditions to separate records.

Terminal nodes

Terminal nodes are also known as leaf nodes. These represent classes and have oneincoming edge and no outgoing edges.

NOTES

85



The basic questions involved for inducing DT are

1. How the records should be splitted?

A suitable attribute test condition based on some objective measures of quality shouldbe selected to divide the records into smaller subsets.

2. What is the stop criterion?

A possible answer to the procedure is repeated till all the leaves belong to the class.Sometimes the procedures can be terminated earlier to gain some advantage.

Some of the attribute conditions and its corresponding outcomes for different attributetypes are given below

Binary attributes

The test condition for a binary attribute generates two outcomes. Here checking ofgender is a attribute test. The outcomes of binary attributes are always resulting in twochildren. Figure 3.3 shows an example of a binary attribute.

Figure 3.3 Binary Attribute

DMC 1628

NOTES


Nominal attributes

Nominal attributes have many values. This can be shown as multi-way split tree wherethe number of outcomes depends on the distinct values of the continuous attributes. Thiscan also be shown as binary splits in 2k-1 -1 ways. Some examples of the multi-way split isshown in Figure 3.4 and Figure 3.5.

Figure 3.4 A Multi-way split

Or as a binary split as

Figure 3.5 Multiway Split as Binary Split

Ordinal types

Ordinal attributes produce binary or multi-way splits. Ordinal values can be groupedas long as the basic order is not violated.

For example the values of the attribute {low, medium, high}. The different ways ofgrouping can be any one of this type as shown in Figure 3.6.

Figure 3.6 Ordinal Type

NOTES

87



Some of the grouping like {low, high} is invalid.

Continuous attributes

The continuous attributes can be expressed as a comparison with binary outcomes ormulti-way splits. For example a student test marks can be expressed as a binary split andshown in Figure 3.7a and 3.7b.

Figure 3.7a Split of Continuous Attribute.

Figure 3.7b Split of Continuous Attribute.

Splitting criteria

The measures developed for selecting the best split are based on the degree of impurityof the child nodes. If the degree of impurity is smaller, then the distribution is more skewed.If the attribute conditions split the records or instances evenly, then the attribute is consideredto exhibit highest impurity.

The degree of impurity is measured by many factors like Entropy, Gini and Informationgain.

1. Information Entropy is measured as

Entropy (t) = -

Here, c is the number of classes for a given tuple.

1

1log

c

i ii

P P

DMC 1628

NOTES


Entropy values ranges from zero to one. If the value is zero, then all the instancesbelong to only one class and if the instances are equally distributed in all classes then thevalue of entropy is close to one.

2. Gini index is expressed as

Gini (t) = 1 - 2

3. Information Gain

If tuple is partitioned into r subsets, then the gain is measured as

Gain (t1, t2, …, tr) = I (t) - I(tj)

Where |t| is the cardinality of t and I(t) can be either entropy or Gini.

3.3.2 ID3 Algorithm

The algorithm ID3 is a top down approach and recursively develop a tree. It usestheoretic measure Information gain to construct a decision tree. At every node the procedureexamines an attribute that provides the greatest gain or greatest decrease in entropy.

The althm can be stated as follows

1. Calculate the initial value of the entropy Entropy (t) = -

2. Select an attribute which results in the maximum decrease of entropy or gain ininformation. This serves the root of the decision tree.

3. The next level of the decision tree is constructed on this criterion.

4. The steps 2 to 3 are repeated till a decision tree is constructed where all the instancesare assigned to a class or the entropy of the system is zero.

Example: consider a following data set shown in Table 3.1.

Table 3.1 Sample Data Set.

1

0( / )

c

ip i t

|

1

|| |

r j

j

tt

1

1log

c

i ii

P P

NOTES

89



The decision tree can be constructed as

There are four C1 classes and five C2 classes. Hence the probability of the classes

P(C1) = 4/9 and P(c2)= 5/9

The entropy of the training sample is given as

Entropy = -4/9 log2 (4/9) – 5/9 log2 (5/9) = 0.9911

The Information gain of the attributes A1 and A2 can be calculated as follows.

For the attribute A1, the corresponding contingency table is shown in Table 3.2.

Table 3.2 Contingency Table of A1

The entropy of the attribute A1 is

= 4/9 [ -(3/4) log (3/4) – ¼ log (1/4)] + 5/9 [ -(1/5) log (1/5) – (4/5) log (4/5)]

= 0.7616

Therefore the information gain for the attribute A1 is

= 0.9911 – 0.7616 = 0.2294

The same procedure can be repeated for the attribute A2

For the attribute A2, the corresponding contingency table is shown in Table 3.3.

Table 3.3 Contingency Table of A2

The entropy of the attribute A1 is

= 5/9 [ -(2/5) log (2/5) – 3/5 log (3/5)] + 4/9 [ -(2/4) log (2/4) – (2/4) log (2/4)]

= 0.9839

DMC 1628

NOTES


Therefore the information gain for the attribute A1 is

= 0.9911 – 0.7616 = 0.2294

Hence the best attribute to split is A1

Gini Index Calculation

The Gini index for the attribute A1 is

= 4/9 [ 1 – (3/4)2 - (1/4)2 ) + 5/9 [ 1 – (1/5)2 – (4/5)2 ] = 0.3444

The Gini index for the attribute A2 is

= 5/9 [ 1 – (2/5)2 - (3/5)2 ) + 5/9 [ 1 – (2/4)2 – (2/4)2 ] = 0.4889

Since Gini Index for the attribute A1 is smaller, it is chosen first. The resultant decisiontree is shown in Figure 3.8.

Figure 3.8 Decision Tree

One of the inherent benefits of a decision tree model is that it can be understood bymany users as it resembles a flow chart. The advantages of the decision tree is its graphicalformat and the interactability it provides to the user to explore the decision tree.

Decision tree output is often presented as a set of rules. The rules of the form If-Thenand it provides a more concise knowledge mechanism especially when the tree is too largeand understanding becomes more difficult..

NOTES

91



Decision trees can be both predictive and descriptive models. It can help us to predictcase-by-case basis by navigating the tree. More often, Prediction is done on automaticmanner without much interference from the user and for multiple new cases through thetree or rule set automatically and generating an output file with the predicted value. In anexploratory mode, the decision tree shows the insight about relationships betweenindependent and dependent variables thus helping the data investigation.

Overfitting

A decision tree which performs well for the given training set but fails for test tuples issaid to lack the necessary generalization. This is called overfitting problem. Overfitting isdue to the presence of noise in the data or due to the presence of irrelevant data in thetraining set. This is also due to the smaller number of training data. Overfitting reduces theclassifier accuracy drastically.

After a decision tree is formed, the decision tree must be explored. Sometimes theexploration of the decision tree may reveal nodes or subtrees that are undesirable. This hashappened because of Overfitting Pruning is a common technique that is used to removessplits and the subtrees created by Overfitting. Pruning is controlled by user definedparameters that cause splits to be pruned. By controlling the parameters the users canexperiment with the tree induction to get an optimal tree ideal for effective prediction.

The methods that are used commonly to avoid over fitting are

1. Pre – pruning: In this strategy the generalization of the tree size beyond certainpoint is stopped based on the measures like chi-square or information gain. DecisionTree is then assessed for the goodness of fit.

2. Post – pruning: In this technique, the tree is allowed to grow. Then post pruningtechniques are applied to remove the unnecessary branches. This procedure leadsto a compact and reliable tree. The post-pruning measures involve

cross validation of data

Minimum Description Length (MDL)

compilation of statistical bounds.

In practice both the methods of pre-pruning and post-pruning are combined to achievethe desired result. Post-pruning required more effort than pre-pruning.

Although the pruned trees are more compact than the original, still they may be largeand complex. Typically, two problems that are encountered are repetition and replication.Repetition is a repeated testing of an attribute (like if age < 30 and then if age < 15 and soon). In replication duplicate sub trees are present in the tree. A good use of multivariatesplits based on combined attributes would solve these problems.

DMC 1628

NOTES


Testing a Tree

Decision tree should be tested and validated prior to integrating with the decisionsupport systems. It is also desirable to test the tree periodically over a larger period toensure that the decision tree maintains the desired accuracy as classifiers are known to beunstable.

3.4 BAYES CLASSIFICATION

Bayesian classifier is a probabilistic model which estimates the class for a new data.Bayes classification is both descriptive and predictive.

Let us assume that the training data has d attributes. The attributes can be numeric orcategorical. Let each point xi be the d-dimensional vector of the attribute xi ={ }

Bayes classifier, maps xj to ci where ci is one of the k – classes.

c = {c1, c2, ….., ck }

Let be the conditional probability of assigning xj to class ci which has highestprobability.

The Bayes theorem can be given as

From the training set, the values that can be determined are p(xi), p(xi / ci ) and p(cj).From these, Bayes theorem helps us to determine the posterior probability p(cj / xi) .

The procedure is given as

First determine priori probability p(cj) for each class by counting the classes of thatappear in the data set.

Similarly the number of occurrences xi can be calculated to determine p(xi). Similarly p(xi | cj) can be estimated by counting how often instance xi belongs to cj.

From these derived probabilities, a new tuple is classified using the Bayesian methodof combining the effects of different attribute values.

Suppose if the tuple ti has p independent variables { 1 2, ,....,i i ipx x x } then we canestimate p (ti / cj) by

p (ti /cj) =

In other words ti is assigned to class cj

2 21 2, ,....., d

ix x x

( )j

j

cp

x

( ). ( )( )

( )

Bp P AA ApB P B

1

( | )p

ik jk

p x c

NOTES

93



iff

p(cj / x) > p(ci/x)

Advantages

1. It is easy to use.2. Only one scan of the database is used.3. Missing values won’t cause much problem as they are ignored.4. In a dataset with simple relationships, this technique would produce good

results.

Disadvantages

1. Bayes classifier assumes that attributes are always independent. This assumptionwill not work for real-time datasets.

2. Bayes classification will not work for continuous data.

Example

Let us assume a simple data set shown in Table 3.4. Let us apply the BayesianClassifier to predict X (1,1)

Table 3.4 Sample Data set

HERE c1 = 2 and C2 = 3

p(c2) = 1/3 ; p(c1) = 1/2 The conditional probability isestimated.

p(a1 = 1 / c1 ) = 1/2 ; p(a2=1 / c2) = 1/2

p(a2 = 0 / c1) = 1/2 ; p(a2=1 / c2) = 2/3

p(x/c1) = p(a1=1/ c1) * p(a2 =1/c1)

= 1/2 * 1/2 = 1/4

DMC 1628

NOTES


p(x/c2) = p(a1=1/c2) * p(a2=1/c1)

= 1/2 * 2/3 = 1/3

This is used to evaluate

p(c1 / x) = 1/4 * 1/2 = 1/8

p(c2 / x) = 1/3 * 1/3 = 1/9

p(c1 / x)> p(c2 / x)

Hence the sample is predicted to be in class c1.

Bayesian Network

A Bayesian network is a graphical model. This model shows the probabilisticrelationships among variables of a system. A Bayesian network consists of vertices andedges. The vertices represent the variables and their interrelationships are edges andassociated probability values. By using their conditional probabilities, we can reason andcalculate the unknown probabilities. Figure 3.9 shows a simple Bayesian network

Figure 3.9: Simple Bayesian Network.

Here the nodes or vertices are variables. The edge connects two nodes. If causalrelationship exists, that is A is the cause of B, then the edge would be directed. Otherwisethe relationship is just correlation, and then the edge would be undirected. In the aboveexample, there exists a correlation relationship between A and B, and B and C. There is nodirect relationship between A and C, but still there may be some impact since both areconnected to B.

In addition, each edge consists of a conditional probability table which stores allprobabilities that may be used to reason or make inferrences within the system.

NOTES

95



The probability here is a Bayesian probability, a subjective measure, indicates thedegree of belief in the event. The difference between the physical probability and Bayesianprobability is that for Bayesian probability, the repeated trails are not necessary. For example,consider a game, the outcome of the game cannot be determined by earlier trials or thetrails cannot be repeated to make probability calculations. Hence it is a degree of beliefreflected as probability.

Apart from the Bayesian probability, the edge represents causal relationships betweenvariables. If a node X is manipulated by some action some times changes the values of Y.Then X is said to be cause of Y.

Also the causal relationships indicate strength. This is done by associating a numberto edge.

The conditional probability for this model would be P(Xi/Pai). Here, Pai is the set ofparents that render Xi independent of all its other parents. Then the joint probabilitydistribution can be given as

P(x1,…,xn) =

Using this, by passing the evidence up and down, a Bayesian network, known asbelief propagation, once can easily calculate the probability of the events.

Advantages

1. The bidirectional message passing architecture is inherent in the Bayesian network.Learning by evidence is unsupervised learning.

2. It can handle incomplete or missing data because the Bayesian network modelonly dependencies among the variables.

3. It can combine domain knowledge and data. Encoding of causal prior knowledgeis straight forward In Bayesian classifier.

4. By using the graphical structure, Bayesian network ease many of the theoreticaland computational difficulties of the rule-based system.

5. It simulates like a human like reasoning.

3.5 OTHER CLASSIFICATION METHODS

3.5.1 Rule Based Classification

One straight forward way to perform classification is to generate rules. Rules are ofthe form IF condition THEN conclusion.

IF-Part contains rule antecedent or precondition. THEN part is rule consequent. If atuple of a database holds true for rule antecedent, the rule antecedent is satisfied and thatrule covers the tuple.

( / )i

P Xi Pai

DMC 1628

NOTES


Decision Rules are generated using a technique called “Covering algorithm”. Thealgorithm chooses best attribute to perform classification based on the training data. Thealgorithm chooses an attribute that minimizes the error and picks that attribute and generatea rule. For the remaining attribute, the probability is recalculated, and the process is repeated.

Find the class that occurs most in the training data.

For each value of the attribute

Count how often the class appears

Find the frequent class

Form a rule like IF attributes value THEN Class

Calculate the error rate. Calculate the errors that occur in the training data, that is thenumber of instances that do not have the majority class.

Choose the rules with the smallest error rate.

The quality of the rule is based on the measures – Coverage and accuracy.

Let us assume the total number of tuples as D. If a rule covers n-tuples, then

Coverage (R) = n covers / D

Accuracy = n correct / n covers

n correct is the rule that is rules that is rule in real aspect as certified by the domainexpert. If a algorithm covers 14 rules and if 7 rules are correct as per the domain expert,then the accuracy is 7/14 = 50%.

What is the difference between Decision Tree and Decision rule?

Rules have no order while all decision trees have some implicit order Only one class is examined at a time for decision rules while all the classes are

examined for the decision tree

3.5.2 K-Nearest Neighbor Technique

The similarity measures can be used to determine “alikeness” of different tuples in thedatabases.

The representative of every class is selected. The classification is performed by assigningeach tuple to the class to which it is more similar.

NOTES

97



Let us assume that the classes are {c1,c2,…,cn} and the database D has {t1,t2,…,tn}tuples. The K-nearest neighbor problem is to assign ti to the class Cj such that Similarity(t,Cj) is greater than or equal to Similarity (t, Cj) where Ci is not equal to Cj.

The algorithm can be stated as

1. Choose the representative of the class. Normally the center or centroid of theclass is chosen as the representative of the class.

2. Compare the test tuple and the center of the class3. Classify the tuple to the appropriate class

One of the commonest scheme is K-nearest neighbor technique where K is the numberof nearest neighbors. This technique assumes that the entire data along with desiredclassification. Thus it assumes the input data as model.

When the classification is required, only K nearest neighbors are considered. Thetest tuple is placed in the class that contains the most items from the set of K closes items.

Thus KNN technique is extremely sensitive to the value of K. Normally it is chosen

such that K. NumberofTrainingItems The default value is normally 10.

3.5.3 Neural Networks

Neural networks are also used to classify the data. It is a valuable tool for classification.

The classification problem using the neural network involves these steps

1. Determine the input nodes, output nodes and hidden nodes. This is determined bythe domain expert and the problem.

2. Determine the weights and functions3. For each tuple, propagate it through neurons and obtain the result.

Evaluate the result and compare it with the actual result.

If the prediction is accurate, adjust the labels to ensure that this prediction has higheroutput weight next time.

If the prediction is not accurate, then adjust the weights

4. Continue this process till the network makes the accurate classification.

DMC 1628

NOTES


Hence the important parameters of the neural network are

Number of NodesNumber of Hidden nodes Weighted functionsLearning technique for adjusting weights

The advantages and the disadvantages of the neural network is tabulated in the Table 3.5.

Table 3.5 Pros and Cons of Neural Network based Classification.

Advantages Disadvantages

1. More robust in handling noise 1. Difficult to understand the rule generation process

2. Improves performance by learning 2. Input values should be numeric

3. Low error rate 3. Generating rules is not a straight forward process.

4. High accuracy

3.5.4 Support vector Machine

Support vector machine is a popular classification method for classifying both linearand nonlinear data. A support vector machine uses a nonlinear mapping to transform trainingdata into a higher dimension. Then it searches for a linear separating hyperplane (or adecision boundary) to separate the data. It searches for the hyperplane using supportvectors (essential training tuples and margins).

3.5.5 Genetic Algorithms

Genetic algorithm models natural evolution. An initial population is created consistingof randomly generating rules. A rule of IF A and NOT B than C – can be encoded as 101.This is called a string. The group of strings forms a population.

Based on the fitness function, the best rule of the current population is selected. Thegenetic algorithm deploys genetic operators like cross over and mutation to form a newpairs of algorithms. Cross over swaps the substrings to form a new pair of rules. Mutationsimply inverts the bits of the rules string. This results in new population. This process ofgenerating new population is continued till a population satisfies a predefined fitness threshold.

3.5.6 Rough Set

Rough set theory is based on equivalence classes. It tries to establish equivalenceclasses within given training data. Rough set tries to approximate the classes that cannot be

NOTES

99



distinguished in terms of available attributes. This class is approximated two sets – lowerand higher approximation.

The lower approximation consists of tuples that belong to class C and upperapproximation consists of tuples that certainly don’t belong to the class C. Decision rulescan then be generated for each class.

3.5.7 Fuzzy approach

Traditional systems are based on binary logic. But many terms like low, medium andhigh are fuzzy. Often depends on the context, these terms take different values. Fuzzy logicapplies fuzzy thresholds or boundaries to be defined for each category. Unlike the traditionalset, here as item can belong to more than one class.

It is useful in dealing vague or imprecise measurement. It can be used by data miningsystems to classify tuples. Some times more than one fuzzy rule is applicable in the graincontext. Each rule contributes a vote for membership in the categories. The votes aresummed up and then the sums are combined. Later the fuzzy output is defuzzified to getcrisp output.

3.6 PREDICTION MODELS

3.6.1 Linear Regression

Numeric prediction of continuous variables for a given input is called predictionproblem. One of the commonest method of prediction is called Linear regression.

Regression analysis is used to model the relationship between one or more independentvariables and a dependent variable. The data can be displayed in a scatter plot to have afeel of the data. The X-axis of the scatter plot is independent or input or predictor variablesand Y-axis of the scatter plot is output or dependent or predicted variable.

The scattered data could be fitted by a straight line. In the simplest form, the modelcan be created as

Y = W0 + W1 * X

Here W0 and W1 are weights and are called regression coefficients. This specifies theY- Intercept and slope of the line. The coefficients can be solved for by the method of leastsquares. This estimates the best fitting line (Because many lines are possible!) that minimizesthe error between the actual data and the estimate.

If D is the training set consisting of all X values, then

W1=1

( )( )D

ixi X yi Y

/ 1

( )( )D

iXi X Xi X

W0 = 1Y W X

DMC 1628

NOTES


Where and are mean value of the X and Y data. The coefficients provideapproximations.

Let us consider an example (Refer Table 3.6)

Table 3.6 Sample Data for Regression

The data can be shown as a scatter plot in Figure 3.10.

Figure 3.10 Scatter Plot

Let us model the relationship as

Y = W0 + W1 x

Here = 34/4 Y = 324/4

substituting this in the equation we get the equation as

Y = 0.925 x + 0.194

Multiple Linear Regressions

This is an extension of linear regression problem. If there are n predictor variables orattributes describing a tuple, then the output is a combination of predictor variables.

NOTES

101



Hypothetically, if the attributes are A1, A2, and A3, then the linear regression model wouldbe Y = w0 + w1 x1 + w2 x2 + w3 x3.

Using the same methods used above, the equations can be solved to get the outputvariable value.

3.6.2 Non linear regression

In real world data, often there may not exists any linear dependence on data. In suchcases, one can apply transformations to convert nonlinear models to linear models.

For example, consider the cubic polynomial y = w0 + w1x + w2x2 + w3x

3

Where x1= x , x2 = x2 and x3 = x3.

Other regression based methods are logistic regression models where probability ofan event occurring is a linear function of a set of predictor variables. Log-linear modelsapproximate multidimensional probability distributions.

3.7 TESTING AND EVALUATION OF CLASSIFIER AND PREDICTION

MODELS

The ability of the classification models to correctly determine the class of a randomlyselected data is known as accuracy. It can be said as the probability of correctly classifyingtest data. Classification accuracy estimation is a difficult task as different training sets oftenlead to somewhat different models and therefore testing is very important.

Accuracy can be determined using several metrics like sensitivity, specificity, precisionand accuracy. The methods for estimating errors include hold-out, random sub-sampling,K-fold cross validation and leave one-out.

Let us assume that the test data has a total of T objects. If objects C are classifiedcorrectly then the error rate is T – C / T

A confusion matrix often used to represent the results of the tests.

Cross validation is a model evaluation method that gives an indication of how well thelearner will perform new predictions for test data. One way to accomplish cross validationis to remove some data during the training phase. Then the data that was removed duringthe training phase is used to test the performance of the classifier on the unseen data. Thisis the basic idea of cross validation.

The holdout method is the simplest method of cross validation. Here the data set isseparated into two sets. One set is called the training set and another set is called testingset. The function approximator fits a function using the training set. Then this function is

DMC 1628

NOTES


asked to predict the unseen values of the testing test. The errors it makes are accumulatedto give the mean absolute test set error. This error is used to evaluate the model.

The advantages of this model are that it is easy to compute but the evaluation canhave a high variance. Also the method is dependent on how the division is made.

K-fold cross validation is an improvement over the previous method. Here the dataset is divided into k subsets. The holdout method is repeated k times. Each time, k-1 setsare considered as the training set and the remaining one set is treated as test data set. Thenthe process is repeated across all the data set for k trails. The overall performance of theclassifier is the average error across all k trials.

The advantage of this method is performance is not dependent on the method ofpartition as every point is to serve as k-1 times as the training set and one time as test data.The variance of the resulting estimate is reduced as the trails (k) increases. But the value ofk is normally set as 10. Hence this method is called as 10-fold cross validation also.The disadvantage of this method is that it takes k times as much computation to make anevaluation.

Leave-one-out cross validation is K-fold cross validation taken to its logical extreme.Here K is equal to the total number of points in the data set. That is N. The functionapproximator is trained on all the data except for one point in the trail. And later a predictionis made for that point. Then the average error is computed and is used later to evaluate themodel.

3.7.1 Evaluation Metrics

There are several metrics that can be used to describe the quality and usefulness of aclassifier. Accuracy is one characteristic. Accuracy can be expressed through a matrixcalled confusion matrix. The matrix informs us the classification that is carried out over thedata. For example consider the following confusion matrix.

TP=number of specimens that are classified correctly FP=number of specimens thatare classified wrongly. These are false alarms. FN=number of specimens that are classifiedwrongly. These are misclassified samples. And TN=number of true negative specimensthat the classifier is able to diagnose correctly.

The results can be better understood by correlating this with the hypothetical classifierfor classifying cancer data .If the classifier result matches with the actual result for cancerpatients, it is called true positive. If the same results match for the non-cancer patients, it iscalled true negative. False positive is a false alarm that classifier says that the patient hascancer when in reality; he is not a cancer patient. False negative is one where the classifierreports that a patient is normal when the patient is indeed a cancer patient. False negative

NOTES

103



has a lot of legal and sociological impacts and a serious error that the classifier should seekto avoid.

Sensitivity

The sensitivity of a test is the probability that it will produce a true positive result whenused on a test data set. The sensitivity of a test can be determined by calculating:

TP / TP + TN

Specificity

The specificity of a test is the probability that a test will produce a true negative resultwhen used on test data set.

TN / TN + FP

Positive Predictive Value

The positive predictive value of a test is the probability that a object is classifiedcorrectly when a positive test result is observed.

TP/ TP + FP

Negative Predictive Value

The negative predictive value of a test is the probability that a object is not classifiedproperly when a negative test result is observed.

TN / TN + FN

Precision

The precision is defined as Precision = TN / ( TN + FP)

It can be shown that the accuracy of the classifier can be shown in terms of sensitivityand specificity.

Accuracy = Sensitivity * (TP/ TP + TN) + Specificity * (TN/ TP + TN).

Like classifier model, the predictor model accuracy is determined using loss functions.The most common loss functions are

Absolute error = 'Yi Yi

Squared error = 2( ')Yi Yi

When the error is carried to the entire set, it results in mean absolute error

DMC 1628

NOTES


Mean Absolute Error = 'd

iYi Yi / d

Mean squared Error = 2

1( ')

d

iYi Yi

/ d

The advantage of Mean absolute error is that it exaggerates the presence of outliers.The errors can be normalized by dividing by the total loss by mean.

– Relative absolute error = 'd

iYi Yi /

1

d

iYi Y

Relative squared error = 2

1( ')

d

iYi Yi

/ 2

1( )

d

iYi Y

The root relative squared error can be obtained by taking the square root of relative

squared error.

The other criteria for evaluating the classifier models apart from accuracy factor are

1. Speed

The time to construct model and time to learn to use the model is often referredspeed. This time should be minimized.

2. Robustness

The classifiers are unstable. The poor quality of the data often results in poorclassification. The ability of the classifier to provide good results in spite of some errors/missing data is important criteria for evaluating the classifier.

3. Scalable

The algorithms should be able to handle large data sets. It is required to handle varylarge data efficiently.

4. Goodness of fit

The models should fit the problem nicely. For example, a decision tree should have“right” size and compactness to have high accuracy.

3.7.2 Model Selection

ROC curves are useful tool for comparing classification models. ROC is an acronymof Receiver Operating Characteristic. It is a plot of sensitivity and the false positive rate fora given model.

The Y-axis is true positive rate and X-axis false positive rate. We start from the bottomleft-hand corner initially. If we have any true positive case, we move up and plot a point. If

NOTES

105



it is a false positive case, we move right and plot. This process is repeated till the completecurve is drawn.

If the ROC curve is closer to the diagonal line then it shows the classifier to be lessaccurate. The area under the curve indicates the accuracy of the model. A model is perfectmodel if it has area under ROC curve as one. A sample ROC curve is shown in Figure3.11.

Figure 3.11 Sample ROC curve.

3.8 CLUSTERING MODELS

Clustering is a technique of partitioning the objects that have many attributes intomeaningful disjoint subgroups. The objects in the subgroups are similar to each other whilediffer from the objects in other clusters significantly. Clustering is a process also calledsegmentation (or dissection). Hence the objective of clustering process is

To discover nature of the sample To see whether the data falls into classes.

Cluster analysis is different from classification. In the classification process, the classesare predefined. The user knows the class types and in the training data, samples of dataare given along with the class label. But cluster analysis is unsupervised learning where theclasses or clusters are not known to the user. Thus the salient points of clustering versusclassification is

The best number of clusters is not known in clustering process No priori knowledge in clustering Cluster results are dynamicA good example of clustering process is to group flowers in to a meaning group

hierarchy. In business, for example, the customer database can be grouped based onmany questions like

DMC 1628

NOTES


a. What customers buy?b. How they buy?c. How much money they spend?d. Customer Lifestylee. Purchasing behaviorf. Demographic characteristics to discover sub groups of diseases.

Problems of clustering

Outlier handling is difficult as outliers themselves may form a solitary cluster. Whenforced, clustering algorithms are force it as part of a cluster- hence cause problem.

Dynamic data handling is difficult. Interpreting the cluster results require semantic meanings and evaluating clustering

results is a bit difficult task.

The aim of the clustering process is explanatory. It aims to group data with small in-group variations and large between-group variations. Thus clustering process often findssome interesting clusters.

The issues of the clustering process are

Desired features of clustering algorithm Measurement of similarity and dissimilarity Categorization of clustering algorithms Clustering of very large data Evaluation of clustering algorithms

The desirable characteristics of clustering algorithms are

Require no more than one scan of the database Online ability to say the present status and ‘best answer’ so far. Suspendable | stoppable | resumable Incremental addition | deletion of instances work with limited memory performing different kinds of scanning the databases process each tuple only once

3.8.1 Types of Data and Clustering Measures

Clustering algorithms are based on similarity measures between objects. Similaritymeasures can be obtained directly from the user or can be determined indirectly fromvectors or characteristics describing each object.

Therefore, it becomes necessary to have a clear notion of similarity. Often it is possibleto derive dissimilarity from similarity measure. For example, if s(i,k) denotes the similarity,then the dissimilarity d(i,k) can be obtained from similarity using some transformations.Some transformations are like

NOTES

107



sik = 11 ikd

dik = ik2(1 S ) Where 1 S ik 0 and Sik = 1 {height similarity} has the property of

distance.

The terms similarity and dissimilarity are often denoted together by the term proximity.

The similarity measures are referred by the term distance informally. Often the termmetric is used in this context. A metric is also a dissimilarity measure that satisfies thefollowing conditions.

1. D(i,j) 0 for all i and j2. D(I,j) = 0 if i = j3. D(i,j) = d(j,i) for all i and j.4. D(i,j) d(i,k) + d(k,j) for all I,j and k

This is called triangle inequality.

Hence not all distance measures are metrics but all metrics are distances.

Distance measures

The distance measures vary depending on the types of data. The chapter two can bereferred for a detailed explanation. For convenience sake, the data types that are usuallyencountered are given in Table 3.7.

Table 3.7 Sample Data Types.

Based on the data type, the distance measures vary. Some of the distance measuresare shown in the Table 3.8.

DMC 1628

NOTES


Table 3.8 Sample Distance Measures.

Hence the distance measures are used to distinguish one object from other and alsouseful for grouping the objects.

Quantitative variables

Euclid distance is one of the most important distance measures. The Euclid distancebetween objects is calculate as below

Distance (Oi,Oj) = 2ik jk( O O )

Suppose, if the coordinates of the objects O1, O2 are (5,6) and (8,9) then the Eucliddistance can be calculated as

D(O1,O2) =

= =

Advantages of Euclid distance

Distance do not change with the addition of new objects However if the units change, the resulting Euclid or squared Euclid change drastically

2 2(5 8) (6 9)

9 9 18

NOTES

109



City-block (Manhattan distance)

Manhattan distance measures

Average distance across dimensions Dampen the effects of outliers

Manhattan Distance (Oi,Oj) = 1/n |Oik-Ojk|

= ½ ( | 5 – 8 | + | 6 – 9 | )

= ½ ( 6 ) = 3

Chebysheve distance:

Distance (Oj , Oi) = max k ( | Oik – Ojk | )

Distance (O1 , O2) = max ( | 5 – 8 |, | 6 – 9 | )

= max ( 3, 3 ) = 3

Binary attributes

Binary attributes will have only two values. Distance measures are different if binaryattributes are used.

For example

This can be converted to a binary attribute based on a sample like

x1 = 1 if height 165cm Otherwise x1 = 0

x1 = 1 if weight 70Kg Otherwise x2 = 0

1

n

k

DMC 1628

NOTES


Then , in the table of matches and mismatches, number of attributes where bothindividual score is “1” = a = 1

Ind = 1 Ind 2 = 0 b = 2

Ind = 0 Ind 2 = 1 c = 3

Ind = 0 Ind 2 = 0 d = 1

Then the following distance measures can be applied depending on the context. Manyof the measures use transitions between 0 and 1 for the calculation of distance.

1. Equal weights for 1 - 1 matches and 0 – 0 matches2. Double weight for 1 – 1 and 0 - 0 matches3. Double match for unmatched pairs4. No 0 – 0 matches in numerators5. 0 – 0 matches are treated irrelevant6. No weight for 0 – 0 but double weight for 1 – 17. No 0 – 0 matches for numerators / denominators.

Double weight for unmatched pairs

8. Ratio of matches to numerators/ Denominators

Mismatches with 0 – 0 matches excluded

Categorical Data

In many cases, we use categorical values. It serves as a label for attribute values. It isjust a code or symbol to represent the values.

For example, for the attribute Gender, a code 1 can be Female and 0 can be male.Similarly, the job description can be 1 for representing the manager, 2 for junior managerand 3 for clerk (Refer Figure 3.12 for hierarchy of categorical data).

To calculate the distance between two objects represented by variables, we need tofind the categories. If the category is just two, techniques like simple matching, Jaccard

ab c

NOTES

111



method or Hamming distance can be used. If the category is more than two, then thecategories can be transformed to a set of dummy variables that is of binary category.

For example

Figure 3.12 Categorical Data

Then John who is working as manager can be coded as John = <1,<1,0,0>> andPeter who is working as a clerk can be coded as <1,<0,0,1>>.

Then the distance between John and Peter is (0, 2/3) where the distance betweentwo objects is the ratio of number of unmatched and total number of dummy variables.

Each category can also be assigned into manay binary dummy variables.

If the number of categories is C, then ‘V’ variables can be assigned for each categoryso that the number of variables will satisfy the equation

V = Celing of ( Log C/ Log 2)

For example, for the category Job Designation = Manager/ Junior Manager/ Clerk,The number of variables that can be assigned to is

V = Ceiling (Log 3/ Log 2)

= Ceiling (1.58)

= 2

DMC 1628

NOTES


Therefore the representation of job designation would be

Manager = [ 1 1]

Junior Manager = [ 1 0]

Clerk = [ 0 1]

So, for the previous example, John would be ( 1, [1,1]) and Peter would be (1,[0,1]). Hence the distance would be the ratio of unmatched and total dummy values.

Therefore the distance between John and Peter is (0, ½)

Another measure that is used for finding the distance between categorical variables isthe percent agreement. It can be defined as

Distance (Oi, Oj) = 100 x [Number of (Oik Ojk)] / x

If Object 1 and Object 2 disagree on 2 attributes then

= 100 x 2 /4 (Total no. of attributes) = 50%

Distance measure for Ordinal variables

Ordinal variables are like categorical values. But they have a inherent order. Forexample if job designation is 1 or 2 or 3 means code 1 is higher than 2 and code 2 is higherthan 3. It is ranked data as 1 >> 2 >> 3.

To compute similarity/ dissimilarity, there are many distance measures like Chebyshev/ Maximum distance, Minkowski distance can be used. There are some specialized distancemeasures like Kendall distance, Spearman distance.

For the previous example, distance is the spatial disorder between two vectors. Thevectors can be pattern vector or disorder vector. Pattern vector has order or sequencesvector and serves as guide. The distance would be the number of operation it takes tomake a disorder vector into a pattern vector.

For example, consider for example, the preferences of three persons are listed as

Person 1 = < Coffee, tea, Milk>Person 2 = < Tea, Coffee, Milk>Person 3 = < Coffee, Tea, Milk>

Then the distance between person 1 and person 3 is zero as all the preferences aresame. For finding the distance between person 1 and person 2, either person 1 can be apattern vector or person 2 can be pattern vector.

NOTES

113



If person 1 is considered as pattern vector, then Person 2 would be disorder vector.Then the distance between these two would be calculated as the number of bits thatdiffers or calculated using traditional distance measures like Chebyshev distance measure.

The selection of pattern or disorder vector don’t matter much as the distance betweenA and B is same as the distance between B and A because of symmetry.

Mixed Types

Normally database has many attributes of all data types. A preferable approach is toprocess all variables together performing a single cluster analysis.

If the data set contains p variables of mixed type, then the dissimilarity matrix d(I,j)between object I and object j can be given as

D(I,j) = /

Where ij = 0 if there are no measurement variable f for object i or object j.

If Xif = Xfi = 0, then variable is asymmetric binary, otherwise ij f = 1.

For Quantitative variable

dij = Xif Xjf / Maxh Xhf - MinhXhf

Where h runs over all nonpromising objects for variable f.

For categorical variables or binary

Dij = 0 if Xif = Xjf

= 1 Otherwise

For Ordinal variables, then the procedure would be to compute the rank. Then

Zif = Rif / Mf – 1 where Zif is treated as quatitative variable.

Vector Objects

For text classification, vectors are normally used. The similarity function for vectorobjects can be defined as

S(X,Y) = Y /

Where Xt is the transposition of the vector X. x is the Euclidean norm of the

vector X and is the length of the vector. S(X,Y) is the cosine of the angle between vectorX and Y.

1

fpf

fij dij

1

pf

fij

tX x

DMC 1628

NOTES


For example, if the vectors are <1,1,0> and <0,1,1> then

S(X,Y) = 0+1+0 / 2 2

If the bits are same, they yield 1 otherwise it is zero.

A simple variation also can be used as

S(X,Y) = XtY / XtX + YtY - XtY

This is known as Tanimoto coefficient or Tanimoto distance often used in Informationretrieval.

3.8.2 Categories of Clustering Algorithms

The Figure 3.13 shows the classification of clustering algorithms. Broadly speaking,the clustering algorithms can be classified into four categories. They are Hierarchical methods,Partitional methods, Grid models and model based methods.

Figure 3.13 Categories of Clustering Algorithm

Partitional methods use greedy procedures that are used iteratively to obtain a singlelevel of partitions. Being greedy approach based methods, often these methods producelocally optimal solutions. Formally speaking, there are n objects here and the objective isto find k clusters such that an object can be assigned to each cluster. The numbers ofclusters are obtained from the user beforehand and will not change during the algorithmrunning. An object should be assigned to only one cluster. But the objects can be relocatedmany times before the final assignment is made.

Hierarchical methods produce a nested partition of objects. Often the results areshown as dendograms. Here the methods are divided into two categories. They areaggloerative methods where a each individuals are considered as a cluster. Then they aremerged and the process is continued to get a single cluster. Divisive methods use another

NOTES

115



kind of philosophy, where a single cluster is chosen, then they are partitioned and theprocess is continued till the cluster is splitted into smaller clusters.

Density based methods use the philosophy that at least some minimum data pointsshould be present within the acceptable radius for each point of the cluster.

Grid based methods partitions the data space rather than the data point based on thecharacteristics of the data. This method can handle continuous data and one great advantageof this method is input data order has no role to play in the final clustering process.

Model based methods tries to cluster data that has similar probability distribution.

3.8.3 Partitional and Hierarchical Methods

Classical clustering algorithms are straight forward Partitional algorithms. They searchin the space of possible assignment c of points to k clusters to find the one that minimizesthe score (or maximize depending on the set score function). This is typical combinatorialoptimization problem. This method is suitable for smaller data sets and generally notapplicable for larger data sets.

These are iterative improvement algorithm which employs greedy approach. Thegeneral procedure is

Start with randomly chosen points; reassign so as to give increase / decrease ofscore function.

Recalculate the updated cluster centers to reassign points till no change in thescore function or cluster memberships.

Advantages of classical algorithms

Simplicity At least guarantee local optimal maximum / minimum of the scoring function.

K-means

K-means algorithm is a straight forward Partitional algorithm. It gets the values of K(the number of clusters) beforehand from the user. Then random k cluster centers andassign each points to the k cluster centers whose mean is minimum.. Then there is arecomputation of mean vectors of the points assigned to cluster and using max as newcenters for iterative approach and this process is continued till no change of instances toclusters is noticed.

The procedure of this algorithm is summarized as follows :

1. Determine the number of clusters before the algorithm is started This is called K.2. Choose K instances randomly. These are initial cluster center

DMC 1628

NOTES


3. Assign the remaining clusters to the closest cluster based on Euclid distance4. Recompute new mean5. Perform the iteration till the new mean is similar to the old mean. Otherwise chose

new mean and go to step 2.

The complexity of k-means algorithm is O (k n I) where I is the number of iterations.The complexity of computing new cluster center is O(n)

K – Medoid

K-medoid algorithm is a Partitional algorithm and the goal is, given k, find krepresentatives in the data set so that when assigning each object to the closest representative,the sum of the distances between the representatives and objects is minimal. This algorithmis similar to K – means algorithm but here only data points in space can become medoidwhile in K – means any point in the space can be mean points. Then based on the calculationsthe medoids are swapped or retained until there is no change for all points for the medoidassumed.

The procedure of K-medoid algorithm is given below

1. Arbitrarily choose K objects as the initial medoids (representatives)2. Repeat

a. Assign each remaining object to the cluster with nearest medoidsb. Randomly select the non-medoid objectsc. Compute the cost S of swapping Oj with O randomly if S<0 then

swap Oj with O randomly to form a new set of K – medoidsd. Until there is no change

3. Done

Hierarchical Clustering Algorithms

Hierarchical methods produce nested clustering tree and are classified into twocategories

Agglomerative (merge) Divisive (divide)

The convenient graphic display to display cluster is to use a tree like structure calleddendograms.

Agglomerative

Agglomerative methods often referred as AGNES (Agglomerative NESTing method).Agglomerative methods are based on the measures of distance between clusters. Theymerge clusters to reduce the number of cluster. This is repeated each time merging twoclosest clusters to get a single cluster. Usually the initial point is every cluster consists ofsingle points.

NOTES

117



The procedure is given as follows1. Place each data instance into a separate cluster2. Till a single cluster is obtained

a. Determine two most similar clustersb. Merge the two clusters into a single cluster

3. Chose a clustering formed by one of the step 2 iteration as final result.

Advantage of Hierarchical methods

No vector representation for each objectEasy to understand and interpretSimplicity

Divisive methods

These algorithms are referred as DIANA (DIvisive ANAlysis). These methods startwith all variables and remove those and whose removal cause least detoriation in themodel. Then split till there is a cluster with a single point. The splitting can be monotheticwhere the Split clusters is done using one variable at a time and can be polythetic wherethe basis of all of the variables together for splitting.

Disadvantages of divisive methods

Computationally intensive. Less widely used.

Hierarchical Methods use distance measures for clustering purposes. Some of thecommon algorithms and the distance measures are given the following Table 3.9.

Table 3.9 Distance Measures.

DMC 1628

NOTES


Single linkage

Consider the array of distancesO1 O2 O3 O4

O1 0

jkD d O2 0O3 8 2 0O4 6 3 4 0

The minimum is 1, so the new object O1,2 is formed.

O1,2 O3 O4

O1,2 0

O3 2 0

O4 3 4 0

Distance (O1,2, O3) = min {Distance (O1,3), Distance (O2, O3)} = min {8, 2} = 2

O1,2,3 O4

O1,2,3 0 0

O4 3 0

Distance (O1,2, O4) = min {Distance (O1, O4), Distance (O2, O4)}

= min {6, 3}= 3

NOTES

119



The corresponding Dendogram is shown in Figure 3.14.

Figure 3.14 Dendogram.

Distance (O1,2,3, O4) = min { D(O1,2, O4), D(O3, O4)}= min (3,4)= 3

Example using complete linkage:

Dist (O1,2, O3) = max {Dist (O1, O, 3), Dist (O2, O3)}= max {8,2}= 8

Dist (O1, 2, O4) = max {Dist (O1, O4), Dist (O2, O4)}= max {6, 3}= 6

O1, 2 O3 O4

O1, 2 0

O3 8 0

O4 6 4 0

O1, 2 O3, 4

O1, 2 0

O3, 4 8 0

DMC 1628

NOTES


The final Dendogram is shown in Figure 3.15.

Figure 3.15 Dendogram for complete Linkage

Dist (O1, 2, O3, 4) = max {Dist (O1, 2, O3), Dist (O1, 2, O4)}= max {8, 6}= 8

Average:

This process is similar to earlier case, but the average is considered.Dist (O1,2, O3) = ½{Dist (O1, O, 3) + Dist (O2, O3)}

= ½ {8+2}= 5

Dist (O1, 2, O4) = ½ {Dist (O1, O4) + Dist (O2, O4)}= {6+3}= 4.5

O1, 2 O3 O4

O1, 2 0

O3 5 0

O4 4.5 4 0

Dist (O1, 2, O3, 4) = ½ {Dist (O1, 2, O3) + Dist (O1,2, O4)}= ½ {5 + 4.5}= 5.25

O1, 2 O3, 4

O1, 2 0

O3, 4 5.25 0

= 5.25

NOTES

121



The final dendogram is shown in Figure 3.16

Figure 3.16 Dendogram for Average Link

Advantages of hierarchical:

1. Correct number of clusters2. Easy to understand3. Easy to detect outlier

3.8.4 Clustering in Larger Databases

The problems with earlier algorithms is that

They operate only in limited memoryAnd the assumption that all data is available at once.

But data mining involves processing of huge dataset. Hence the data mining algorithmsshould be scalable. Hence a scalable clustering algorithm operates in the following manner

1. Read a subset of Database into main memory2. Apply cluster techniques to data in main memory3. Combining results with those from prior samples4. In memory data divided into

a. Those items that will always be needed even if next sample brought inb. Those that can be always be discarded with appropriate updates to

data being kept in order to answer the problemc. Those that can be stored in compresses format

5. If termination criteria are not met, then repeat from step 1

Some of the scalable algorithms are BIRCH, ROCK and CHAMELEON. Thefollowing sections deal with the highly scalable algorithms.BIRCH

BIRCH is an acronym of Balanced iterative reducing and clustering which is aincremental algorithm where there is a need to adjust memory requirements to the size thatis available. The algorithm uses the concept of CF that is clustering feature. CF is a tripletsummarizing info about sub clusters.

CF = ( N, SS,LS ) whereLS= Linear sum on N points

1

n

ii

x

SS = Squared sum of data points i.e, =2

1

n

ii

x

DMC 1628

NOTES


Example:(1, 1) (2, 2) (3, 3)

N = 3LS

= (6, 6)

SS = (13, 13)CF = (5, (6, 6), (13, 13))

CF tree is a balanced tree with two parameters, They are1. B B ronchirs factor which

Specifies max no, of children and2. T Threshold which specifies the

- max diameter of sub clusters stored at the leaf nodes

The CF Tree characteristics are

built dynamically and the process of building CF-Tree Envelope insertion of correct leaf node. check balance of the threshold if necessary splitting the leafs and rebuilding the process

* If memory is limited then merging takes place to the nearest cluster

Birch algorithem can be written as

Input :

D = { 1x

, 2x

, ….., nx

} //set of elementsB Branch factor

- max no, of childrenT Threshold

- max diameter

Output:

K set of clustersFor each Xi D do

Determine correct leaf node for Xi for insertionIf threshold condition is not violatedThen

Add Xi to cluster and update CF tripletsElse

IF there is a provision to insert Xi

Then

NOTES

123



Insert Xi as a single cluster and update CFElse

Split leaf node and redistribute CF features.

ROCK

Rock is an acronym of Robust Clustering using links. It is an hierarchical clusteringmethod using the concept of links.

Traditional algorithms use the concept of distance measures for clustering algorithms.Some times due to noise or outlier components clustering process yields very poor results.

The approach of ROCK algorithm is that instead of adopting a local approach, ituses a concept of neighborhoods of individual pair of points.

If two points share similar neighborhoods, then two points belong to the same clusterand can be merged. That is, for two points Pi and Pj to be neighbors, then similarity (Pi, Pj)> Theta, where similarity is a function and theta is the user specified threshold. The numberof common neighbors of Pi and Pj is called links. If the number of links is more, then theybelong to the same neighborhood. Approximately this is equivalent to Jaccard coefficientapplied to the transaction database.

For the transaction database, the similarity function between the transaction Ti andcan be given as

Similarity (Ti, Tj) =

The steps of ROCK can be summarized as

1. Using the idea of data similarity and neighborhood concept, a sparse graph isconstructed

2. Perform agglomerative hierarchical clustering on the sparse graph3. Evaluate the model.

ROCK is a suitable for very large databases.

CHAMELEON

Chameleon is a hierarchical clustering algorithm. This algorithm uses the dynamicmodeling concept to determine the similarity between the pairs of clusters.

The algorithm uses two measures. They are Interconnectivity and Closeness.

The relative connectivity R(Ci,Cj) is defined as the absolute connectivity between Ciand Cj normalized with respect to the internal interconnectivity.

DMC 1628

NOTES


R(Ci,Cj) = ( , )EdgeCut Ci Cj ½ * ( )EdgeCut ci + ( )EdgeCut cj

EdgeCut(Ci,Cj) is the edge cut for a cluster containing both Ci and Cj.

EdgeCut(Ci or Cj) is the minimum sum of cut edges that partition the cluster roughlyinto equal parts.

The relative closeness is the absolute closeness between Ci and Cj normalized withrespect to the internal closeness.

RC(Ci,Cj) = Average weight of the edges that connect Ci and Cj ((Average Weight of the edges (Ci) * Ci / Ci + Cj ) + (Average Weight of the edges (Cj) * Ci / Ci + Cj ))

The algorithm is implemented in three steps

1. Use K-Nearest approach to construct a sparse graph. Each vertex is a data object.The edges represent the similarity of the objects. The graph represents the dynamicconcept. Neighborhood radius is determined by the density of the region. In asparse region, the neighborhood is defined more widely. Also the density of theregion is the weight of the graph.

2. The algorithm then partitions the sparse graph into a large number of small clusters.

The algorithm is based on the partitioning the graph based on min-cut algorithm. Here thecluster C is partitioned into the clusters Ci and Cj so as to minimize the weight of the edges.Edge cut indicates the absolute interconnectivity between the clusters Ci and Cj.

3. Chameleon then uses a hierarchical agglomerative algorithm repeatedly to mergethe subclusters into a larger cluster based on the similarity based on the relativeinterconnectivity and relative closeness.

3.8.5 Cluster Evaluation

Evaluation of clustering is difficult as no test data available as in classification. Evenfor some meaningless data some data are obtained. There is no satisfactory methods availablefor evaluating the results. But general guidelines of a good clustering are

1. Efficiency2. Ability to handle missing data3. Ability to handle noisy data4. Ability to handle different attribute types5. Ability to handle different magnitudeThe essential conditions to be satisfied by a good cluster also include properties like

scale-invariance, richness (ability to obtain good clusters on all attribute values/methods)and consistency. Consistency in clustering means the shrinking and expanding the result,the cluster results should not vary. But it is still difficult to find a clustering algorithm thatfulfils all these three criteria.

NOTES

125



Summary

Classification is a supervised method that attempts to classify an instance to aclass

Regression is sort of classification model that can predict continuous variable The major issues of classification are Overfitting, missing data and performance Types of classifiers range from statistical, distance based, decision tree, rule based

and soft computing techniques The decision tree can be constructed using the measures like entropy, Information

gain and Gini Index The techniques of pruning can be used to avoid Overfitting Bayesian classifier is a probabilistic model which estimates the class for a new

data Bayesian network shows the relationships among variables of a system Nearest neighboring techniques used to determine “alikeness” of different tuples in

the database Regression analysis is used to model the relationship between one or more

independent variables and a dependent variable whereas in multiple regressionproblems, the output is a combination of predictor variables.

The testing techniques are hold-out method, K-fold cross validation The evaluation metrics of the classifier are sensitivity, specificity, positive predictive

value, negative predictive value, precision and accuracy The predictor model use absolute error, mean absolute error, mean squared error,

relative absolute error, and relative squared error. The other criteria to evaluate the classifier models are speed, robustness, scalability

and goodness of fit. Clustering is a technique of partitioning the objects. Problems of clustering include outlier handling, dynamic data handling and

interpreting the results Clustering uses similarity measures for clustering K-means and K-medoid algorithm are traditional Partitional algorithms. Dendograms are used to display the hierarchical clustering results. BIRCH, ROCK and CHAMELEON clustering algorithms are used to cluster

very large data set. The measures that are used to cluster are efficiency, ability to handle missing data/

noisy data, ability to handle different attribute types/ magnitude.

DID YOU KNOW

What is the difference between classification and prediction? What is the difference between distance measure and a metric? What is the difference between similarity and dissimilarity? Is it possible to obtain

dissimilarity if the similarity measures are available? Traditional clustering algorithms are not suitable for very large data set. Justify.

DMC 1628

NOTES


Short Questions

1. What is meant by a classification model?2. Distinguish between the terms: Classification, regression and estimation.3. What are the issues of classifiers?4. What are the types of classifiers?5. What are the measures of measuring the degree of similarity?6. Distinguish between the terms: Entropy, Information gain and Gini Index.7. What is meant by pruning?8. State Bayesian rule.9. What are the advantages and disadvantages of Bayesian classification?10. What is meant by Bayesian network?11. What are the advantages of Bayesian network?12. List out the techniques of classification methods?13. What are the important parameters of a neural network?14. What are the advantages and disadvantages of neural network based classification?15. Distinguish between the terms: regression, non-linear regression and multiple

regression.16. Enumerate the evaluation method of classifier and predictor?17. What is meant by clustering?18. What is meant by clustering?19. What are the advantages and disadvantages of clustering?20. Enumerate the distance measures?21. Enumerate the different clustering methods?22. What are the advantages and disadvantages of K-means and K-medoid algorithms?23. What are the problems associated with clustering of a large data?24. What is the methodology of evaluating a cluster?25. What is meant by edge cut and relative closeness?

Long Questions

1. Explain in detail the method of constructing a decision tree.2. Explain in detail the ID3 algorithm.3. Explain in detail Bayesian classifier.4. Explain in detail the Bayesian network.5. Explain in detail the regression models.6. Explain in detail the soft computing methods for classification?7. List out the different similarity measures.8. Explain in detail the methodology of K-means and K-medoid algorithm.9. Explain in detail the problems associated with clustering of very large data.10. Compare and contrast the algorithms – BIRCH, ROCK and CHAMELEON for

clustering very large data.

NOTES

127



UNIT IV

DATA WAREHOUSINGINTRODUCTION

Data mining will continue to take place in environments with or without Data warehouse.Data warehouse is not prerequisite for data mining. But there is a need for organized,efficient data storage and retrieval structure. Therefore Data Warehouse is importanttechnology and is a direct resultant of information age. This chapter presents the conceptsof data warehouse and its relevance for data mining.

Learning Objectives

To differentiate Data warehouse from traditional database, data mart and operationaldata store

To study the architecture of a typical Data warehouse To study the fundamentals of multidimensional models To study the OLAP concepts, OLAP systems To study the issues, problems and trends of data warehouse and its relevance for

data mining.

4.1 NEED FOR DATA WAREHOUSE

Large amount of data is stored by the organizations for making decisions for their dayto day activity. For this reasons, the organizations run many different applications for theirneed. Some of the applications include human resources, Inventory, sales, marketing andso on. These systems are called Online Transaction Processing Systems (OLTP) systems.These applications are mostly relational databases but include many legacy applicationswritten in many programming languages and in different environments.

Over the years, organizations have developed thousands of such applications. Thisproliferation can lead to a variety of problems. For example, the generation of reports formanagers becomes difficult as organizations spread across different geographical locations.Hence there is a need for a single version of enterprise information containing data of highlevel accuracy and user friendly interface that can analyze the queries for making effectivedecisions.

DMC 1628

NOTES


In order to meet the requirements of the decision makers, it makes sense to create aseparate database that stores information that is of interest to the decision makers. Thenew system can help the decision makers in analyzing the patterns and trends.

The solutions that have been proposed to tackle this problem are two solutions. Onesolution is through dimensional modeling and another solution is through either operationaldata store or Data warehousing.

4.2 OPERATIONAL DATA STORE (ODS)

Operational Data Store or simply ODS is one solution that is designed to provide aconsolidated view of the business organization current operational information.

Inmon and Imhoff defined ODS as a subject-oriented, Integrated, volatile,current valued data store, containing only corporate data.

In contrast Data warehouse does not contain any operational data. So it is betterreview the ODS first and then using that concept, before developing a data warehousesystem. In short, ODS can provide assistance for grand data warehouse project.

ODS is a subject-oriented data. OLTP contains application oriented data. For example,a payroll contains all data that is relevant for payroll application. On the other hand, ODSdata is generic. It is designed to contain data related to the major data subjects of thebusiness organizations. ODS contains integrated data as it derives data from multiple datasources. ODS provides a consolidated view of the data of the organization.

ODS is current valued. This means that ODS contains the up-to-date information.ODS data is volatile as ODS constantly refresh itself using new information periodically.ODS data is also a detailed data because its is detailed enough to enable the decisionmakers to take effective decisions.

Thus ODS may be viewed as the organizations short term memory and also considereda miniature type of data warehouse.

The advantages of the ODS system are

1. It provides a unified view of the operational data. It can provide a reliable store ofinformation.

2. It can assist in better understanding of the business and customers of the organization.3. ODS can help to generate a better reports without having resort to OLTP systems

and other legacy systems.4. Finally, ODS can help to populate data warehouse. This reduces the time for the

development of final data warehouse.

NOTES

129



Design and Implementation of ODS

The Typical Architecture of ODS is given Figure 4.1

Figure 4.1 ODS System

The extraction of information from different data sources needs to be efficient. Alsothe data of the ODS is constantly refreshed and this process is carried out regularly andfrequently. Quality of the data is checked very often for various constraints.

Populating ODS is the acquisition process of extraction, transformation and loadingdata from various data sources. This is called ETL.

ETL tasks looks simple, but in reality requires a great skill in terms of management,business analysis and technology. ETL process involves many off-the-shelf tools and dealswith different issues like

1. Handling different data sources is a complex issues as data sources includes flatfiles, RDBMS and various legacy systems.

2. Compatibility between source and target systems3. Technological constraints4. Refreshing constraints5. Quality and Integrity of data6. Issues like backup, recovery etc.

Transformation is an important area where problems like multi-source data are tackled.The problems include instance identity problems, data errors, data integrity problems. Asound theoretical basis of data cleaning is given below

DMC 1628

NOTES


1. Parsing : All data components are parsed for errors.2. Correcting: The errors are rectified. For example, a missing data may be corrected

using the relevant information of the organization3. Standardizing: The organizations can evolve a rule to standardize the data. For

example, a format dd/mm/yyyy or mm/dd/yyyy can be fixed for all data of DATEtypes. Then all the DATES are forced to comply with this business rule.

4. Matching : Since most of the data are related, the data must be matched to checkits integrity.

5. Consolidation: The corrected data should be consolidated for building a singleversion of the enterprise data.

Once the data is cleaned, then it is loaded into the ODS. A suitable ETL tools can beused to automate the ETL process.

4.3 INTRODUCTION TO DATA WAREHOUSE

Data Warehouse is defined as “A Data Warehouse is a subject oriented, integrated,time – variant, and non – volatile collection of data in support of management’s decisionmaking process” (W.H.Inmon). Thus the Data Warehouse is different from operationalDB in four aspects.

Subject oriented: The data that is stored in Data Warehouse is related to subjects likesales, product, supplier and customer. The aim is to store data related to a particularsubject so that decisions can be made.

Integrated: Data Warehouse is an integrated data source. It derives data from variousdata sources ranging from flat file to databases. The data is cleaned and integrated beforestored in Data Warehouse.

Non – volatile: The operations on Data Warehouse is limited to adding and updation ofdata. The data is a permanent store of data and normally not deleted. These data over aperiod of time becomes historical data.

Time variant: To make correct decisions there is a need for explicit or implicit time constrainton Data Warehouse structure to facilitate the trend analysis.

Another definition of the Data Warehouse (Sean Kelly)

Separate Available Integrated Time stamped Subject oriented Non volatile Accessible

NOTES

131



Subject oriented data

Normally, all business organizations run variety of applications like Payroll, Sales andInventory. They organize data sets around individual applications to support their operationalsystems. In short, the data is required to suit the functionalities of the application software.

In contrast, Data Warehouse organizes the data in the form of subjects and not byapplications. The subjects vary from organizations to organizations.

For example, an organization may choose to organize data using subjects like Sales,Inventory and Transport etc. The insurance company may have subjects like claims etc.Inshort, the Data Warehouse data is meant for assisting the business organizations to takedecisions. Hence there is no application flavour.

Integrated

The data in the Data Warehouse comes from different data sources. The data storesmay have different databases, files and data segments. Data Warehouse data is constructedby extracting the data from various disparate sources and its inconsistency are removed,then it is standardized, transformed, consolidated and finally integrated.

The data is then preserved in a clean form so that it is devoid of any errors.

Time variant data

As the time passes, the data that is used by the organizations becomes stale andobsolete. Hence the organizations always maintain current values. On the other hand, theData Warehouse data is meant for analysis and decision making. Hence the past data isnecessary for analyzing and performing trend analysis.

Hence the Data Warehouse has to contain historical data along with the current data.Data is stored in snapshots. Hence the data in Data Warehouse is of time-variant in naturethat allows analysis of the past, relates information to the present and effectively makesprediction of the future.

Non-Volatile data

While the data in the operational systems is intended to run day-to-day business,Data Warehouse data is meant for assisting the organizations in decision making. Hencethe Data Warehouse data are not updated in real time. While we change, add, deleteoperational DB at will, the Data Warehouse are not very commonly updated. That is thedata is not as volatile as the data in operational database.

Data warehouse sounds similar to ODS. But it differs from ODS in some aspects.They are tabulated in the following Table 4.1.

DMC 1628

NOTES


Table 4.1 ODS Vs Data Warehouse

Data Warehouse also sounds similar to the databases. So how does a how does adata Warehouse differ from a database?

Even though they share main concepts like tables, schemas and so on , Data warehouseis different from the database in a number of aspects and these differences are so fundamentalthat it is often better to treat data warehouse differently from the database.

1. The biggest difference between the two is that most databases place more emphasison applications. More often the applications cater to a single domain like payrollprocessing, financial and not contain data of multiple domains. In contrast, datawarehouses always deal with multiple domains. This allows the data warehouse toshow how the company as a single whole entity rather than in individual pieces.

2. Data warehouses are designed to support analysis. The two types of data thatnormally company will possess is operational data and decision support data. Thepurpose, format, and structure of these two data types are different. In most cases,the operational data will be placed in a relational database while the data warehousewill have all the decision support data.

3. In the relational database, tables are frequently used. Normally the normalizationrules are used to normalize data to remove redundancy. While this mechanism ishighly effective in an operational database, it is not conducive to decision makingas the changes are not maintained and monitored. For example, a student databasemay have the current data, often it will not show the evolution of that database. Inshort, historical information is absent in database. In this situation, decision supportis often useful and data warehouse takes the responsibility of maintaining the historicalinformation.

4. Data warehouse data often differs from the relational database in many aspectslike time span. Generally database data are involved with time span that is atomicor current but generally deal with a short time frame. However, data warehousedata deal with higher span of long time frames. Another difference is granularity ofdata. The decision support data is a detailed as well as summarized data that hasmany different parts of aggregation.In short, Data warehouses are more elaborate than a mere database.

NOTES

133



Advantages and Disadvantages of a Data Warehouse

There are number of advantages the business organizations can derive by using adata warehouse. Some of the advantages are listed below.

1. The organization can analyze the data warehouse to find historical patterns andtrends that can allow them to make important business decisions.

2. Data warehouse enables the users to access a large amount of information. Thisinformation can be used to solve number of queries of the user as well as thedecision makers. Data warehouse can access data from different locations in acombined form. Hence the decisions made by the decision makers will assist inincreasing its profits to reduce the cost of computing.

3. When data is taken from multiple sources and placed in a centralized location, anintegrated view enables the decision makers to make a better decision than theywould if they looked at the data separately.

4. Data mining is connected to data warehouses. Hence data warehouse and datamining can complement each other.

5. Data warehouses create a structure which will allow changes to the data. Thechanged data is then transferred back to operational systems.

Disadvantages

1. Data must be cleaned, loaded, or extracted to get quality results. This process cantake a longer period of time and many compatibility issues are involved.

2. Users who will be working with the data warehouse require training.

3. The data warehouse is accessed by higher level management. Hence confidentialityand privacy are important issues. Accessing the warehouse through the Internetinvolves large number of security problems.

4. It is difficult to maintain data warehouses. Any organization that is consideringusing a data warehouse must decide the benefits of warehouse versus the cost ofthe project to decide upon the worthiness of data warehouse project.

Strategic Use of a Data Warehouses

It is not enough for a company to simply acquire a data warehouse if the businessorganizations are not able to utilize it properly. It is crucial to use data warehouse effectivelyin order to make important decisions.

Three categories of decision support can be defined

1. Reporting data: reporting data is considered to the lowest level of decision support.But it is necessary for the business organizations to generate informative reports tocarry out successful business.

2. Analyzing Data: The Data warehouse is having plenty of tools for multidimensionalanalysis. Hence the business organizations should use tools properly to analyze

DMC 1628

NOTES


data that can assist the business organizations. It is important to understand thepast, present, and future. Hence a company can analyze the information to learnabout the mistakes they made in the past, and they can find ways to avoid themistakes getting repeated in the future. Companies also want to place an emphasison learning. This is a process important for a company to maneuver quickly to findcompetitive edge among their competitors.

3. Knowledge discovery: Knowledge mining takes place through data mining. Hencecompany will study patterns, connections, and changes over a given period of timein order to make important decisions. A data warehouse is also a tool that canallow companies to measure their success and failures in terms of the decisionsmade.

Data Warehouse Issues

There are certain issues involved in developing the data warehouses that companiesneed look at. A failure to prepare for these issues may result in a poor data warehouse.

1. The first issue is the quality of company data. One issue that confronts theorganization is the time they need to spend on loading and cleaning data. Someexperts believe that typical data warehouse project involve 80% of their time inthis process.

2. It is difficult to estimate time for data warehouse project. More often it is probablylonger time than the initial estimate.

3. Another issue that companies will have to face is security. Often the problem is todecide what kind of data or information that can be placed in the warehouse.Security and other issues often suddenly crop us thereby delaying the project.

4. Balancing the existing OLTP systems and applications versus data warehousemay cause serious problems. Often the business organizations need to take decisionswhether or not the problem can be fixed via the transaction processing system ora data warehouse.

5. There are many problems associated with data validation.6. Estimating budgets for developing, maintaining and managing the data warehouse

is a big issue.

Data Warehouse and Data mart

Data mart is a database. It contains a subset of Data Warehouse data. The DataWarehouse data is divided into many data marts so that data can be accessed faster.

There is a considerable confusion among the terms data warehouse and data martsalso. It is essential to understand the differences that exist between these two. Datawarehouses and data marts are not the same thing. There are some differences betweenthe Data warehouse and Data mart.

.

NOTES

135



1. A data warehouse has a structure which is separate from a data mart. Data mart isspecific. Say it can store groups of subjects related to a marketing department. Incontrast, a data warehouse is designed around the entire organization as a whole.Data warehouse is owned by the organization and not by specific departments.

2. Data contained in data warehouses are highly granular, while the information indata marts are not very granular.

3. Information stored in data warehouse is very large compared to data mart.4. Much of the information that is held in data warehouses is historical and not biased

for any single department. It takes a overall view of the organization rather than aspecific view.

Role of Meta data in Data warehouse

Metadata is data about data in the Data Warehouse and is stored in repository. Themetadata is a sort of summarized data of the warehouse. It includes current and old detaileddata, lightly or highly summarized data. These serve as index and helps to access the datapresent in the Data Warehouse. The metadata has

i. Structures to maintain schema, data definitions | hierarchyii. History of the migrated dataiii. Summarization algorithmsiv. Mapping constructs to help the mapping of the data in the operational data to

the Data Warehousev. Data related to the profiles and schedules to improve the performance of the

system

4.4 DATA WAREHOUSE ARCHITECTURE

The arrangement of components is called architecture. The components are arrangedin a form to extract maximum benefits. The basic components of the Data Warehouse areas follows,

1. Source data component2. Data staging component3. Data storage component4. Information delivery component

The basic components are generic on nature but the arrangement may vary as per theorganization needs. The variation is due to the fact that some of the components are strongerthan others is the architecture.

The basic components are described below.

1. Source data component

The source data that is coming to the Data Warehouse are in four broad categories.

DMC 1628

NOTES


This component concerns about the data that serves as input for the Data Warehouse.This consists of four categories of data.

Operational Systems data

Operational system provide major chunk of data for the Data Warehouse. Based onthe requirements of the Data Warehouse, the different segments of the data are chosen.But the problem associated with the data is that they lack conformance. Hence the greatchallenge of the warehouse designers is to standardize and transform data and integrate toprovide a useful data for the warehouse.

Internal data

These data belong to the individual user’s data and may be available in the form of aspreadsheet, databases. These data may be useful for the warehouse and a suitable strategyis required to utilize the data.

Archived data

Operational systems periodically backup the data. The frequency of the backupoperation may vary. Data Warehouse use historical data to take decisions. Hence warehouseuses archived data as well.

External data

Business organization use external data for their day-to-day operation. Informationlike market trends in stock exchanges is vital and is a typical external data. Usually externaldata vary in data formats. These data should be organized so that warehouse can make useof these data.

2. Data staging component

Data staging component is a workbench for performing three major functions,

1. Extraction2. Transform3. Loading

Extraction function plays a very important role as Data Warehouse pulls data fromvarious operational systems. The data differ in formats and sources. These range from flatfiles to different relational databases. This step involves various techniques and tools toprepare data suitable for the Data Warehouse requirement.

Data transformation performs crucial role of data conversion. First the external datais cleaned. The cleaning may range from correction of spellings, missing value analysis toadvanced standardization of data. Finally after application of different techniques, this stepprovides a integrated data that is devoid of any data collection errors.

NOTES

137



Data loading concerns about both initial loading of data into the Data Warehouse andto perform incremental revisions on a regular basis.

Collectively this is called ETL operation.

3. Data storage component

After the data is loaded, it is stored in a suitable data structure. Generally this is keptseparate from the storage of data related to operational systems. Normally operationaldata vary very often. Hence the analysts should have the stable data of the warehouse.That is the reason why Data Warehouses are “read-only” data repositories. No addition |deletion is done on real time and updation is restricted to few people.

4. Information delivery content

This component is intended to serve the users who range from novice user to advancedanalyst who wishes to perform complex analysis. Hence this component provides differentmethods of information delivery that includes adhoc reports; statistical analysis is usuallyonline that includes periodical delivery on Internet as well.

The basic components are generic on nature but the arrangement may vary as per theorganization needs. The typical three tier Data Warehouse architecture is shown below.The three-tier architecture is shown in Figure 4.2.

Figure 4.2 Three Tier Architecture

DMC 1628

NOTES


The bottom layer consists of difference data sources. It is a relational DB system.The data is extracted from the operational database through an interface called gateway.ODBC | JDBC is an example of the gateway.

The next layer is the Data warehouse or Data Mart. A data mart contains subset ofthe corporate data that is of value to the specific group of users. Normally data mart isimplemented using low cost machines or Dependent data marts derive data directly fromthe entire warehouse or independent data mart derives data directly from the operationaldatabase. Metadata are data about data which includes structure of the Data warehouse,operational metadata, algorithms for summarization, mapping, performance related dataand business data which includes business terms and definitions.

The next layer is OLAP server. It can be either ROLAP model or MOLAP model.

The front end tools and utilities perform functions like data extraction, data cleaning,transformation, data load and refreshing.

4.5 IMPLEMENTATION OF DATA WAREHOUSE

OLAM integrates data mining with OLAP and performs knowledge extraction inmultidimensional data. OLAM is important because of high quality data in warehouse;OLAP based exploratory analysis and on-line selection of data mining algorithms. Theintegration of Data mining and OLAP is shown in Figure 4.3.

Figure 4.3 Integrating OLAP and DM.

OLAM engine perform data mining tasks like concept description, association,classification, prediction, clustering and so on.

NOTES

139



4.6 MAPPING THE DATA WAREHOUSE TO MULTIPROCESSOR ARCHITECTURE

Data warehouse environments are generally large and complex. So it requires a goodserver management policies to monitor processes and statistics. Having multiple CPUallows Data warehouse to perform more than one job. Hence parallelization plays animportant role in the effective implementation of Data warehouse server.

Server Hardware

This is the most crucial part. There are some architectural options. They are

Symmetric Multi-Processing (SMP)

Massively Parallel Processing (MPP)

The Figure 4.4 shows a SMP machine that is a set of tightly coupled CPU that sharememory and disk.

Example:

Figure 4.4 Tightly Coupled SMP Machine.

A cluster is a set of loosely coupled SMP machines connected by an “Interconnect”.Each machine is called node. Every node has its own CPU and memory. But all shareaccess to a disk. These mimic a larger machine. Softwares manage the shared disk in adistributed fashion. This kind of software is called distributed lock manager.

Massively Parallel Processing (MPP)

A MPP machine is made up of many nodes. All the nodes are loosely coupled andlinked together by the high speed Interconnect. This is shown in Figure 4.5.

DMC 1628

NOTES


Figure 4.5 MPP Machine

MPP use distributed lock manager to maintain integrity.

4.7 DESIGN OF DATA WAREHOUSE

The first step in building data warehouse design is data modeling. A data modeldocuments the structure of the data independent of data usage. A common usage is theentity relationship diagram. An ERD shows the structure of the data interms of the entityand relationships.

An entity is the concept that represents a class of persons ,places or things. Thecharacteristics of the entity are called attributes. The attribute that is unique is called a key.Entities are connected by various relations. The relations can be one to one, one-to-manyor many-to-many.

Figure 4.6 Sample ERD

Subject Name

Subject

Offere

Subject ID

Faculty Faculty ID

Faculty Name

Designation

NOTES

141



Once the ERD is completed, the model is analyzed. The analysis involves applicationof normalization rules. The objective of the normalization is to remove the redundanciesthat are present in the relational database. The first normal form (1NF) specifies that allattributes to have a single value. The entities are in the 2NF if all the non key attributes aredependent on the primary key. 3NF requires that values of all non key attributes aredependent on the primary key.

Data warehousing design is different from the database design. The design involvesthree steps.

1. Conceptual model2. Logical design model3. Physical design model.

Conceptual Data Model

Conceptual data model include identification of important entities and the relationshipsamong them. At this level, the objective is to identify the relationships among the differententities.

Logical Data Model

The steps of the logical data model include identification of all entities and relationshipsamong them. All attributes for each entity are identified and then the primary key andforeign key is identified. Normally normalization occurs at this level.

In data warehousing, it is common to combine the conceptual data model and thelogical data model to a single step.

The steps for logical data model are indicated below:

1. Identify all entities.2. Identify primary keys for all entities.3. Find the relationships between different entities.4. Find all attributes for each entity.5. Resolve all entity relationships that is many-to-many relationships.6. Normalization if required.

Physical Data Model

Features of physical data model include:

Specification all tables and columns. Specification of Foreign keys. Denormalization may be performed if necessary.

At this level, specification of logical data model is realized in the database.

DMC 1628

NOTES


The steps for physical data model design involve Conversion of entities into tables,conversion of relationships into foreign keys, conversion of attributes into columns andchanges to the physical data model based on the physical constraints.

Multidimensional Model

Dimensional data model is used in data warehousing systems. This is different fromthe traditional 3NF design of traditional OLTP system. Multidimensional model is a way toview and integrate data in a database. The data can be stored using different data structuresfor effective retrieval. It may require storing data using multiple dimensions.

Dimension is a category of information. For example, a sale data involves threedimensions like region, time and product type. The user query may be to summarize dataregion wise or for a period or for a particular product type. Hence it makes sense to storedata in a form so that the retrieval time is faster. This type of modeling is called dimensionalmodeling.

Attribute is a unique level within a dimension. For example, year is an attribute in theTime Dimension. Often the attributes have relationships among them. This kind of relationshipis called Hierarchy. Hierarchy represents the relationship between the different attributeswith a dimension. For example one possible hierarchy in the Time dimension is Year ’!Quarter ’! Month ’! Day.

The specific object of interest is stored in a table called fact table. Fact table containsthe measures of interest. For example, the sales in the fact table contain the attributesrelated to sales. Often this is a numeric data. Fact consists of measures and context datameasures are for which queries are used and dimension facilitate the retrieval of data.

The dimensions can be stored as a table or a cube. This table is called Lookup Table.This table contains all the detailed information about the attributes. For example, the lookuptable for the year in the time dimension contains a list of all of the year available in the datawarehouse.

Each dimension is seen as a cube and each dimension is seen as the axis of the cubeshown in Figure 4.7.

NOTES

143



Figure 4.7 Data Cube

Product

The dimension level support partial order or a total order. The symbol < can be usedto represent the order relationship. Some of the order relationships are shown above aslattice.

Operations like aggregation can be used to aggregate data like sum of sales of aparticular product type in a particular region. But dimensions are not always additive.

Multidimensional schema

Schema is used to represent multidimensional data. Some of the popular schemasinclude,

Star schema Snow flake schema Constellation schema

In the star schema design, a single fact table is in the middle and is connected to otherdimensional data radially like a star. A star schema can be simple or complex. A simple starconsists of one fact table while a complex star can have more than one fact table as part ofschema.

Company year country

Product category month state

Product type day district

Product town

DMC 1628

NOTES


The snowflake schema is an extension of the star schema. Here the point of the starexplodes into more points. The main advantage of the snowflake schema is the improvementin the query performance. But the main disadvantage of the snowflake schema is the additionalmaintenance efforts needed due to the availability of more number of lookup dimensiontables.

Star schema is a graphical schema which shows data as a collection of facts anddimensions and is shown in Figure 4.8.

Figure 4.8 Star Schema.

The centre of the star contains a fact table. The simplest star schema has one facttable with multiple dimension tables.

Fact table contains data that is required for queries and can be very large. Each tupleof the fact table contains a link to the dimension table. The dimension table contains muchdescriptive information about dimension. The entire star schema can be obtained usingrelational system.

For multidimensional table, it is better to develop indexes so that the index reducesthe overhead of scanning very large databases. Here the first bit of the index represents thefirst tuple, the second bit – the second tuple and so on. Hence the bits of the index are usedto represent the tuples. To find the specific tuple, each tuple should be associated with aparticular bit position. This facilitates the operations like aggregation and joins.

One of the common problems of the data warehouse is the “Slowly ChangingDimension” problem. This problem is due to attributes that varies over time.

For example consider a customer record where there are two attributes. The twoattributes are customer name and location. For example, John, Delhi is a valid customerrecord. But if John moves to another location say Delhi, the problem is how to record thisnew fact.

NOTES

145



If the new record replaces the original record, then the old record would disappear.In that case, there is no trace of the old record exists. On the other hand, if a new recordis added into the customer dimension table, then this is duplication as the customer istreated as two people. One more way of tackling this issue is to modify the original recordis modified to reflect the change.

These scenarios can be evaluated and the star schema should be used to reflect thechanges. This is required because data warehouse is supposed to carry the historicalinformation to enable the managers to take effective decisions.

4.8 OLAP

OLAP stands for On-Line Analytical processing. It is designed to provide complexresults compared to traditional SQL. OLAP performs analysis of data and presents theresults to the user as a report. This aspect differentiates OLAP from traditional SQL.

OLAP

There has been a considerable confusion between the terms – OLAP, data warehouseand data mining. OLAP is different from other two as it is just a technology concerned withthe fast analysis of information. Basically it is a front end tool to process information that ispresent in a warehouse. Some times the term business intelligence is used to refer bothOLAP and data warehousing.

OLAP provides a conceptual view of the data warehouse multidimensional data.Data cubes are generalization of spread sheets that essentially provide a multidimensionalview.

The original definition of OLAP system given by E.F. Code is

“OLAP is the dynamic enterprise analytic required to create, manipulate, animateand synthesize information from exegetical, contemplative and formulaic data analysismodels”

By exegetical means the analysis is from manager point of view and by contemplativemeans it is from the view of the person who conceived it, thought about it and it is accordingto some formula. Or one can view OLAP as the advanced analysis on sharedmultidimensional information.

Characteristics of OLAP system

The difference between OLAP and OLTP is obvious. While the users of OLTP aremainly middle and low level management, the users of OLAP systems are decision makers.

DMC 1628

NOTES


OLAP systems are designed to be subject oriented while OLTP systems are applicationoriented.

OLTP data are mostly read and changed regularly. But OLAP data are not updated inreal time.

OLTP mostly deals with the current information. But OLAP systems are designed tosupport decision makers. Hence OLAP system require historical data over a large periodof time.

OLTP systems support day to day activity of the business organization. Mostly theyare performance and availability driven. But OLAP are management critical and are usefulto the management.

Queries of the OLTP are relatively simple. But OLAP queries are complex and oftendeal with many records. The records are both current and historical data. Table 4.2 tabulatessome of the differences.

Table 4.2 OLTP Vs OLAP

What are the characteristics of OLAP systems?

OLAP characteristics are often collectively called FASMI characteristics based onthe first character of the characteristics. They are described as

1. Fast : OLAP systems should be able to furnish information quickly. For that thequeries should be executed in faster manner. But achieving fastness is a difficulttask. It requires a suitable data structures and sufficient hardware facilities to achievethat kind of performance. Some times, the precomputation of aggregates are doneto reduce the execution time.

OLTP OLAP

Mostly Middle and low level Management

Higher level management for strategic decision making

Daily operations Decision support operations

Simple Query Complex Query

Application Oriented Subject Oriented

Current Data Historical, Summarized multidimensional data

Read/ Write/ update at any time No real time updation

NOTES

147



2. Analytic: OLAP systems should provide rich functionalities. The system should beuser friendly and OLAP system should cope with multiple applications and users.

3. Shared: OLAP system is likely to be accessed by the higher level decision makers.Hence the system should provide sufficient security for confidentiality and security.

4. Multidimensional: OLAP should provide multidimensional view of the data. This isoften referred as cube. The multidimensional structure should allow hierarchiesthat shows the relationships among the members of the dimension.

5. Information: The OLAP system should be able to handle a large amount of inputdata and information. It should be integrate effortlessly with the warehouse asOLAP systems normally obtain data from the warehouse.

Codd’s OLAP characteristic

E.F. Codd has identified some of the important characteristics of the OLAP systems.Some of the important characteristics are

1. Multidimensional conceptual view2. Accessibility3. Batch extraction Vs Interpretative4. Multi user support5. Storing OLAP results6. Extraction of missing values7. Treatment of missing values8. Uniform reporting performance9. Generic dimensionality10. Unlimited dimensions and aggregation levels

Data Cube implementations

The data cubes ranges from thousands to lakhs depending on the businessorganizations. Normally decision makers want results quickly. Hence there should be somesort of strategy to handle data cube manipulation. Some of the strategies are mentionedbelow

Precompute and Store all

Often there is a need to reduce time drastically. Hence it is better to compute the datacubes and store it. But this solution is practically not implementable as the storage of morecubes is a difficult task. Also creating indexes for data cubes pose a great problem.

Precompute and store some

Here the data cubes are precomputed but not stored. So there is no space requirementand data cubes are executed on the fly. However the response time is very poor.

DMC 1628

NOTES


Precompute and store some

In this strategy frequently accessed cubes are precomputed and stored. Some times,the data cube aggregate may be derived from other data cubes. This often leads to betterperformance.

The types of operations OLAP provide includes,

Simple query

This includes obtaining value from a single cell of a cube.

Slice

This operation looks at a sub cube by selecting one dimension.

Dice: This operation looks at a sub cube by selecting two or more dimensions. Thisoperation then includes slicing in one dimension and then rotating the cube to select on asecond dimension.

Roll up

This operation allows user to move up in the aggregation hierarchy like looking at theoverall sales of the company.

Drill down

This operation allows the user to get more detailed information by navigating down inthe aggregation hierarchy looking for detailed fact information.

Visualization

This operation allows the user to see results in a visual form for better understanding.Slice and dice can be seen together as subdividing the cube on dimensions. Drill up anddrill down can be fastened by pre-computing and storing the frequently accessedaggregations.

4.9 OLAP MODELS

The models that are available are

ROLAP Relational On-Line Analytical Processing

MOLAP Multidimensional On-Line Analytical Processing

DOLAP Desktop analytical processing and is a variation of ROLAP

NOTES

149



In MOLAP, the On-Line Analytical Processing is implemented by storing datamultidimensional. Usually MDDB is a vendor propriety system and store data in the formof multidimensional hyper cubes. In ROLAP, OLAP engine resides in the desktop. In thismodel, no prefabricated cubes are created beforehand. Rather relational data is presentedas a virtual MD data cubes.

MOLAP model

This is the traditional way of OLAP analysis. In MOLAP the data is stored in theform of multidimensional cube. The storage is not in the form of a relational table. HenceMOLAP is forced to choose various propriety standards that are available in the market.

The kind of processing in MOLAP is shown in the Figure 4.9.

Figure 4.9 MOLAP Model

OLAP engine reside in the special server that stores the propriety multidimensionalcubes.

Advantages:

* MOLAP cubes are built for fast data retrieval. Hence the performance of the systemis designed for providing excellent performance in retrieval of the information. Also thesecubes prove to be optimal for slicing and dicing operations.

DMC 1628

NOTES


OLAP engine reside in the special server that stores the propriety multidimensionalcubes.

Advantages

* MOLAP cubes are built for fast data retrieval. Hence the performance of the systemis designed for providing excellent performance in retrieval of the information. Also these

cubes prove to be optimal for slicing and dicing operations

* MOLAP systems can perform complex calculations: All calculations have beenpre-generated when the cube is created. Hence, complex calculations are not only feasiblebut return results in a faster manner.

Disadvantages

MOLAP is limited in the amount of data it can handle. It is not possible to include alarge amount of data in the cube itself because normally only the summarized data will beincluded in the cubes.

This requires additional investment as Cube technology is often proprietary. So thebusiness organizations should be in a position to make additional investment in human andcapital resources. ROLAP

In ROLAP model, the OLAP engine resides in the desktop. Here the prefabricatedcubes are not created. Instead the relational data itself is presented as virtual data cubes.

The system is as shown in the following Figure 4.10.

Figure 4.10 ROLAP Model

Data Warehouse

OLAP SERVER(Multidimensional

Database)

Desktop

NOTES

151



In the ROLAP model, the data is stored in the form of relational database. Then thedata is presented to the users as a form of dimensional data. Also the data is hided from theuser by a thin metadata layer.

The user presents the query to the middle layer. The analytical server converts theuser request into a set of complex queries and accesses the data from the data warehouseThen, the middle layer constructs the cube on the fly and presents the cube to the user.Here unlike the MOLAP, static structures are not created. The architecture of ROLAPmodel is shown in Figure 4.11

Figure 4.11 Architecture of ROLAP Model.

The major advantages of the ROLAP is

1. Supports all basic OLAP features and functionalities2. Mainly stores the data in the relational form3. Supports some form of aggregation

Data Warehouse / RDBMS Servers

ANALYTICAL SERVER

Dat

a

Com

plex SQL

DESKTOP CLIENT

USER

REQ

UEST

Dyn

amic

Cub

es

DMC 1628

NOTES


The differences between MOLAP and ROLAP is shown in Table 4.3.

Table 4.3 ROLAP Vs MOLAP Models

What is better choice? Often the business organizations need to take decisions basedon the user requirements, budgetary considerations and the requirement of queryperformance and complexities of queries.

Summary

ODS is a subject-oriented, Integrated, volatile, current valued data store, containingonly corporate data

ODS provide a unified view enabling the management to understand business modelsbetter.

ETL is a process of populating ODS with data A Data warehouse is a subject-oriented, Integrated, time-variant, and non-volatile

collection of data in support of management’s decision making process. Data warehouse helps in decision support like – reporting data, analyzing data and

knowledge discovery through data mining. The types of data mart are base models and hybrid models. Data warehouse architecture has four components – source data component, data

stage component, data storage component, information delivery component A cluster is a set of loosely coupled SMP machine connected by an interconnect. Data warehouse design includes conceptual model, logical model, normally

combined into one, and physical model. The popular schema includes star schema, snowflake and constellation schema. OLAP characteristics are called FASMI characteristics Data cube implementations are precompute and store all, precompute and store

none and Precompute and store some. OLAP models include MOLAP and ROLAP.

S.No. ROLAP MOLAP

1. Data stored in the relational form

Summary data is stored in the proprietary multidimensional Data bases.

2 Support very large volume Support for moderate volumes

3 Use complex SQL query to fetch data from warehouse

Prefabricated data cubes by MOLAP engine

4 Known environment because of relational nature.

Slightly unknown environment complicated by proprietary standards

5. Lesser speed because cubes should be generated on the fly

Faster access.

NOTES

153



DID YOU KNOW?

1. What is the difference between ODS, Data mart and Data warehousing?2. What is the difference between database, data warehouse and OLAP?3. What is the link between Data warehouse and Business Intelligence?

Short Questions

1. What is the need for data warehouse?2. What is the difference between data warehouse and database?3. What is the standard definition of Data warehouse?4. What is the difference between Data warehouse and ODS?5. What are the advantages and disadvantages of Data warehouse?6. How to use data warehouse strategically?7. Enumerate some of the data warehouse issues.8. What is the difference between Data warehouse and data mart?9. What is meant by meta data?10. What are the basic components of Data warehouse?11. What is ETL?12. What is meant by dimensional modeling?13. What is a fact table and dimension lookup table?14. What is meant by OLAP?15. What are the characteristics of a OLAP system?16. What is the difference between OLAP and OLTP system?17. Enumerate some of the OLAP cube operations.18. What are the different types of OLAP models?19. What are the advantages and disadvantages of ROLAP and MOLAP?20. What is the difference between ROLAP and MOLAP models?

Long Questions

1. Explain in detail the ODS design.2. Explain in detail the Data warehouse architecture.3. Explain how data mining can be integrated to OLAP model?4. Explain in detail the Data warehouse design.5. Explain the problem of “Slowly changing dimension” problem. Suggest how data

warehouse schema solves this problem.6. Explain in detail the OLAP operations.7. Explain in detail the OLAP models and its implementation.

DMC 1628

NOTES


NOTES

155



UNIT V

APPLICATIONS OF DATA MINING

INTRODUCTION

Data mining is used extensively used in variety of fields. This chapter presents someof the domain specific data mining applications. The social implications of the data miningtechnologies are discussed. The tools that are required to implement the data miningtechnologies are presented. Finally, some the latest developments like data mining in thedomains of text mining, spatial mining and web mining are mentioned briefly.

Learning objectives

To study some of the sample data mining applications To study the social implications of data mining applications To explore some of the latest trends of data mining in the areas of text, spatial data

and web mining. To discuss the tools that are available for data mining

5.1 SURVEY OF DATA MINING APPLICATIONS

Data mining and warehousing technologies are used widely now in different domains.Some of the domain areas are identified and some of the sample applications are mentionedbelow.

Business

Predicting the future is a dominant theme in business. Many applications are reportedin the literature. Some of them are listed here

Predicting the bankruptcy of a business firm Prediction of bank loan defaulters Prediction of interest rates for corporate funds and treasury bills Identification of groups of insurance policy holders with average claim cost

Data visualization is also used extensively along with data mining applications whenevera huge volume of data is processed. Detecting credit card frauds is one of the majorapplications deployed by credit card companies that exclusively use data mining technology.

DMC 1628

NOTES


Telecommunication

Telecommunication is an attractive domain for data mining applications because telecomindustries have huge pile of data. The data mining applications are like

Trend analysis and Identification of patterns to diagnose chronic faults To detect frequently occurring alarm episodes and its prediction To detect bogus calls, fraudulent calls and identification of its callers To predict cellular cloning fraud.

Marketing

Data mining applications traditionally enjoyed great prestige in marketing domain.Some of the applications of data mining in this area include

Retail sales analysis Market basket analysis Product performance analysis Market segmentation analysis Analysis of mail depth to identify customers who respond to mail campaigns. Study of travel patterns of customers.

Web analysis

Web provides an enormous scope for data mining. Some of the important applicationsthat are frequently mentioned in the data mining literature are

Identification of access patterns Summary reports of user sessions, distribution of web pages, frequently

used/visited pages/paths. Detection of location of user home pages Identification of page classes and relationships among web pages Promotion of user websites Finding affinity of the users after subsequent layout modification.

Medicine

The field of medicine is always been a focus area for the data mining community.Many data mining applications have been developed in medical informatics. Some of theapplications in this category include

Prediction of diseases given disease symptoms Prediction of effectiveness of the treatment using patient history

Applications in Pharmaceuticals Company always are always of interest to data miningresearchers. Here the projects are mostly discovery oriented projects like discovery ofnew drugs etc.

NOTES

157



Security

This is another domain that traditionally enjoys more attention of data mining community.Some of the applications that are mentioned in this category are

Face recognition/Identification Biometric projects like identification of a person from a large image or

video database. Applications involving multimedia retrieval are also very popular.

Scientific Domain

Applications in this domain include

Discovery of new galaxies Identification of groups of houses based on house type/geographical location Identification of earthquake epicenters Identification of similar land use

5.2 SOCIAL IMPACTS OF DATA MINING

Data mining has plenty of applications. Many data mining applications are ever-present(ubiquitous) data mining applications which affects us in our daily life. Some of the examplesare web search engines, web services like recommender systems, intelligent databases,and email agents which have overbearing influence in our life. Web tracking can help theorganization to develop a profile of the users. The applications like CRM(Customer RelationManagement) helps the organization to cater to the needs of customer in a personalizedmanner, helps them to organize their products, catalogues to identify, market and organizethe facilities.

One of the recent issues that have cropped up is the question of privacy of the data.When organizations collect millions of customer data, one of the major concerns is howthe business organizations use it. These questions have created more debates of code ofdata mining.

Some of these are looked in the context of “fair information practice” principles.These principles govern the quality, purpose, usage, security, accountability of the privatedata.

The report says that the customers should have a say in how their private data shouldbe used. The levels are

1. Do not allow any analytics or data mining2. Internal use of the organization3. Allow data mining for all uses.

DMC 1628

NOTES


These issues are just beginning. The sheer amount of data and the purpose of datamining algorithm to explore hidden knowledge will generate great concerns and legalchallenges.

Some of the fair information report principles like

1. Clear purpose and usage should be disclosed in the data collection stageitself.

2. Openness with regard to developments, practices, and policies with respectto the private data.

3. Security safeguards to ensure that private data is secured. It should takecare of loss of data, unauthorized data access, modification or disclosure.

4. Participation of people

Privacy preserving data mining is a new area of data mining which concerns about theprivacy protection during data mining process. The aim is to avoid misuse of data whilegetting all the benefits of data mining research can bring to humanity.

5.3 DATA MINING CHALLENGES

New data mining algorithms are expected to encounter more diverse data sources/types of data that involve additional complexities that need to be tackled. Some of thepotential data mining challenges are listed below

Massive datasets and high dimensionality

Huge database provide combinatorial explosive search space for model induction.This may produce patterns that are not always valid. Hence data mining algorithms shouldbe

1. Robust and efficient2. Usage of good approximation methods3. Scaling up of existing algorithms4. Parallel processing in data mining

Mining methodologies and User Interaction issues

Mining different levels of knowledge is a great challenge. There are different types ofknowledge and different kinds of knowledge may be required at different stages. Thisrequires that database should be used in different perspectives and development of datamining algorithms is a great challenge

User Interaction problems

Data mining algorithms are usually interactive in nature as users are expected to interactwith the KDD process at different points of time. The quality of the data mining of algorithms

NOTES

159



can be rapidly improved by incorporating the domain information. This helps to focus andspeedup the algorithms.

This requires the development of high-level data mining query languages to allowusers to describe ad-hoc data mining tasks by facilitating the necessary data. This must beintegrated with the existing database or data warehouse query language and must beoptimized for efficient and flexible data mining.

The discovered knowledge should also be expressed in such a manner so that theuser can understand it. This involves development of high-level languages, visualrepresentations, or similar forms. This requires the data mining system should adopt toknowledge representation techniques like tables, trees etc.

Data handling problems

Managing the data is always a quite challenge for data mining algorithms. Data miningalgorithms are supposed to handle

1. Non standard data2. Incomplete data3. Mixed data – involving numeric, symbolic, image and text.

Rapidly changing data pose great problems for the data mining algorithms. Changingdata make previously discovered patterns invalid. Hence the development of algorithmswith the incremental capability is required.

Also the presence of spurious data in the dataset leads to an over-fitting of the models.Suitable regularization and re-sampling methodologies needs to be developed to avoidoverfitting of models.

Assessment of patterns is a great challenge. The algorithms can uncover thousands ofpatterns, which are useless for the user and lack novelty. Hence development of suitablemetrics that assess the interestingness of the discovered patterns is a great challenge.

Also most of the data mining algorithms deal with multimedia data, which are storedin a compressed form. Handling a compressed data is a great challenge for data miningalgorithms.

Performance challenges

Development of data mining algorithms that are efficient and scalable is a greatchallenge. Algorithms of exponential complexity are of no use. Hence from databaseperspective, efficiency and scalability are key issues.

Modern data mining algorithms are expected to handle interconnected data sourcesof complex data objects like multimedia data, spatial data, temporal data, or hypertext

DMC 1628

NOTES


data. Hence development of parallel and distributed algorithms to handle this huge anddiverse data is a great challenge.

5.4 TEXT MINING

This section focuses the role of data mining in text mining. Major amount of informationis available in text databases, which consists of larger collection of documents from varioussources like books, research papers and so on. This data is semi structured data.

The data may contain few structured data like of authors, title etc. Also some of thedata components like abstract and contents are unstructured data.

Two major areas that are often associated with text is information retrieval and text mining.

On-Line library catalog system is an example of information retrieval system whererelevant document is retrieved based on user query.

Thus information retrieval is a field concerned with the organization and retrieval ofinformation from a large collection of text-related data. Unlike databases where problemsand issues like concurrency control, recovery, transaction management, text retrieval itselfhave problem like unstructured data, approximates search etc. Here the information canbe “pulled” or can be “pushed” on the system in that case is called filtering systems orrecommender systems.

Basic measure of text retrieval:

The measure of the accuracy of information retrieval is precision and recall.

Precision

This is the measure which indicates the percentage of retrieval document that are in

fact relevant to the query.

Recall

Recall is a metric which indicates the documents that are relevant to the query and isdefined as

Recall =

1 | |log| |t

dd

}retrieved{}retrieval{}relevent{ Precision

{ } { }{ }

relevant retrievedrelevant

NOTES

161



Based on the precision and recall, one common tradeoff, a measure called F-Score can beused.

Information retrieval

Information retrieval indicates that based on a query, the information can be retrieved.One good example is web search engine. Here, based on the query, the search engineperforms a matching of key word with bulk of texts to retrieve user requested information.

Hence, the retrieval problem can be visualized as,

1. Document selection problem2. Document ranking problem

In document searching problem, the query is considered as specifying constraints forselecting the document. The user can give query. One typical system is the Boolean retrievalsystem, where the user can give a query like “bike or car”. The system would then returnthe query to fulfill the requirement of the user.

The document ranking problem, the documents are ranked based on the “relevancefactor”. Most systems present a ranked list based on the user keyword query. The goal isto approximate the degree of relevance of a document with a score computed based onthe frequency of words of the document | collection.

One popular method is vector-space model. In this method both document and aquery represent vectors in the high-dimensional space of all possible keywords. Thereforesimilarity measure is used to approximate document vector and query vector. Here similarityvalues are used to rank the documents.

The steps of the vector –space model are given as

1. The first step is called tokenization. This is a preprocessing step whose purposeis to identify keywords. A sort of “stop-list” is used to avoid indexing irrelevantwords like “The- a “ etc.

2. Identification of group of words based on commonality – word stem is used togroup the documents is used.

3. Term frequency is a measure which finds the number of occurrences of terms inthe document. The factor term-frequency matrix is used to associate term withrespect to the given document. Its value is zero if the document does not containthe term and non-zero otherwise. If t is used to denote term and d is used todenote document, then

2/)precisionrecall(precision recall score-F

DMC 1628

NOTES


TF(d,t) = { 0 if freq (d,t) = 0 1+log(1+log(freq(d,t))) otherwise

The importance of the term‘t’ is obtained using the measure called Inverse DocumentFrequency (IDF).

IDF(t) =

Where,

d = document collection

dt = set of document that have the term ‘t’

4. Combine TF and IDF, which form the resultant,

TD – IDF (d,t) = TF(d,t) x IDF(t)

Text Indexing

The popular text-indexing techniques are

1. Inverted index and2. Signature file

Inverted index is an index structure that maintains two hash indexed or B+ tree indextables. Typically the tables involved are document table and term table. Both tables containan identifier ID and a list of terms that occur in the document sorted based on some relevancefactor.

Signature file is another method which stores signature record for each document. Asignature is a fixed size vector. A bit is set to 1 if the term occurs in the document, otherwiseit is set to zero.

Query processing

Once the indexing is done, the retrieval system can answer the keyword by lookingup to the documents that contain the query keywords. A sort of counter is used for eachdocument and updates are made for each query term. The documents are fetched later thatmatch the term and increase their scores.

A sort of relevance feedback can be used to improve the performance.

One major limitation of these methods is that they are based on exact matching. Theproblems associated with matching are synonym problems where the vocabulary differand polysemy problem where the words mean different things in different contexts.

1 | |log| |t

dd

NOTES

163



Dimensionality reduction

The number of terms and documents are huge. This leads to a problem of inefficientcomputation. A mathematical technique of dimensionality reduction is used to reduce thevectors so that the application can be implemented effectively. Some of the techniquesused are latent semantic indexing, probabilistic semantic analysis, and locality preservingindexing techniques.

The major approaches of text mining are

1) Keyword-based approach2) Tagging approach3) Info-extraction approach

Keyword-based approach discovers relationships at a shallow level and finding co-occurring patterns.

Tagging can be manual process or can be automatic way of categorization ofdocuments.

Information extraction approach is more advanced and may lead to the discovery ofdeep knowledge. But it requires semantic analysis of the text using NLP or machine learningapproaches.

Text-mining Tasks

1) Keyword based association analysis2) Document Classification analysis3) Document Clustering analysis

Keyword Based analysis

This analysis collects set of keywords or terms based on the frequency and extractsassociation or correlation relationships among them.

Association analysis first extracts the terms, preprocess them using stop-words list.Only the essential keywords are taken and stored in the database using the form

< ID, List of Keywords >

Then association analysis is performed on them.

The words that appear together are called term or a phrase. Association analysis canperform compound associations or non compound associations. Compound associationsare domain-dependent terms/phrases, hence association analysis helps to tag the terms /phrases automatic and also help in reducing the meaningless results.

DMC 1628

NOTES


Document Classification Analysis

Classification helps to classify documents into classes so that document retrieval isfaster.

But classification of text is different from the relational database because relationaldata is well structured. But text databases are not structured, that is the keywords associatedwith the document are not organized into any fixed set of attributes. Therefore the traditionalclassifications like decision tree are not effective for text mining.

Normally the Classifications that are used for text classifications are

1) K-nearest neighbor Classifier2) Bayesian Classifier3) Support vector machine

K-nearest neighbor uses similarity measure of the Vector-Space model forClassification. All the documents are indexed. Indexes are associated with class label.When a test document is submitted, it is treated as a query. All the documents that aresimilar to the query document are returned by the classifier. The Class distribution can berefined by tuning the query based on the refinements to get a good classifier with goodaccuracy

Bayesian Classifier is another technique that can be used for effective documentClassification.

Support vector machine can be used to perform classification because they workvery well in the higher dimensional space.

Association-based Classification is effective for text mining. It extracts a set ofassociated frequently occurring text patterns.

It extracts keywords and terms Association analysis is applied. Concept hierarchies of the Keywords/ terms are obtained using WordNet or expert

knowledge, and then class hierarchies are formed. Association mining can then be used to discover a set of associative terms that can

be used to maximally distinguish one class or documents from others. Theassociation rules associated with each document class is derived.

Such rules can be ordered based on the discriminative power and occurrencefrequency.

These rules are then used to classify the new documents.Document Clustering Systems

Document clustering is one of the most important topic in text mining. Initially, a spectral clustering is used to reduce the dimensions of the document.

The mixture model Clustering method models text data with a mixture model.

NOTES

165



It performs clustering using two steps

1) Estimate the model parameters based on text data and Prior knowledge2) Infer the Clusters based on the estimated model parameters.

5.5 SPATIAL DATA MINING

Spatial data mining is the process of discovering hidden but potentially useful patternsfrom a large set of spatial data. Spatial data are data that have location component. Thespecific location should be physical space such as address or longitude or latitude.

Spatial data can be stored in a spatial database. Spatial databases are constructedusing special data structures or Indices using distance or topological information. Some ofthe spatial data characteristics are mentioned below

1) Rich data types2) Spatial relationship among variables.

If a point says a house is affected by an earthquake, most probably the neighborhouse also would be affected. This property is called spatial autocorrelation.

3) Spatial autocorrelation among features.4) Observations that is not independent.

Traditionally statistics can be applied only when a data is independent of neighbors.Hence the term geo-statistics is normally applied to the spatial data as spatial statistics isoften associated with a discrete space.

Data I/P

The inputs for the spatial data mining algorithms are more complex as they includeextended objects like points, lines, polygons etc.

The data input also includes spatial attribute and non-spatial attributes. The non-spatial attributes includes information like name, Population, disease type etc.

Spatial queries

The Spatial queries are like

“Find all houses near the lake.

“Find the regions affected by Fire etc.”

Spatial queries, unlike traditional queries, do not use arithmetic operators like< , > etc. Instead, they use operators like near, contained in, overlap etc. Thus therelationships are spatial in nature.

DMC 1628

NOTES


More often the spatial queries can be categorized as – region wise or range queryasking for regions, nearest neighbor query to identify closer objects and distance scan toidentify objects within a certain distance.

The relationships that are often covered by the spatial queries are shown in Table 5.1.

Table 5.1 Spatial Operations

Data Mining Tasks for spatial Data

Spatial OLAP Association Rules. Classification / Trend analysis Spatial Clustering methods.

Spatial OLAP

The kind of dimensions present in the spatial data include

Non-Spatial dimension

This represents non-spatial data.

Spatial-to-non-Spatial dimension

This tackles data at two levels. The base data is spatial but the generalizations are notspatial. They are non-spatial in nature.

Spatial –to-Non-Spatial dimension

This dimension includes data which are spatial both at the primitive level and at higherlevel.

The measures of the spatial type can be

Numeric:This contains only numeric data.Spatial:These may be a set of pointers pointing to the spatial object.

Disjoint Region A is disjoint from B, if there are no common Points.

Overlaps or Intersects At least a common point exists between region A and B

Equals

Covered by or Inside or Contained:

Region B covers A if all the points of A is covered by region B.

Covers of Contains: B contains a iff B is contained by A

NOTES

167



Once the cube can be constructed, the queries can be answered just like the non-spatial category.

Mining Association rules

Association rules can be applied to spatial data. The extracted association’s rules areof the form

A B (SY, CY) where A and B are sets of Spatial or non-spatial predicates. SY is thesupport of the rule and CY is the confidence of the rule.

Spatial associations occur at different level. For example, the following is an associationrule.

Is_a(x, ’school’) ̂ closed_to(x, ‘Big playground’) => close_to(x, “city”) (50%, 50%)

Associations also identify groups of particular features that appear frequently closerto each other. It is called mining of spatial-co-locations.

Spatial Clustering methods

Spatial Clustering is a process of grouping spatial objects to clusters. This processensures that the spatial objects clustered together have high similar and are dissimilar to theobjects in the other clusters. This process can be applied to a group of similar objects alsoall classical clustering algorithms can be used to cluster the spatial data.

Spatial Classification algorithm

Spatial Classification analyzes special objects to derive classification scheme such asneighborhood.

This requires the identification of spatial related factors and by performing and byperforming on max properties, relevance analysis, best attributes can be selected. Thentraditional classification algorithms like decision trees can be used to classify the spatialdata.

Spatial trend analysis also can be applied to detect changes and trends along spatialdimension. This extracts trend of Spatial or non spatial data changing with space.

Sometimes, both time and space change. Traffic flows is an example. Spatio-temporalclassification scheme can be for these sorts of data.

DMC 1628

NOTES


5.6 WWW MINING

Currently Web is the largest data source. Web has many characteristics that makesmining the web a challenging task. Some of the characteristics are listed below:

1) The amount of data that is present in the Web is so huge.2) Web is a repository where all kinds of data are present. This ranges from structured

tables, semi structured web pages and unstructured text and multimedia files.3) The data are heterogeneous in nature and all the significant amount of information

is linked. The page that is referred by most of other pages is called authoritativepage.

4) Web data contains noise. The noise is due to the reason that many of the data areunauthentic in nature as any one can put any information on the net.

5) Web is a dynamic content where changes are constant. Changing content andmanagement of dynamic data are biggest concerns.

Web mining aims to discover hidden information or knowledge of web data. The webdata includes web hyperlink structure, web page content and web usage data. Based onthese, the web mining tasks can be categorized into three categories

Web Structure Mining

Web pages are connected by links (hyperlinks). These hyperlinks can be mined to getuseful information like important web pages. Traditional algorithms are not helpful as normallyrelational table has no link structure.

Web Content Mining

Web content mining tasks mine web page contents to get useful information orknowledge. It is useful to cluster similar web pages and is useful in the classification of webpages. The typical applications include customer profiling etc.

Web Usage Mining

Web usage mining refers to the process of mining user logs to discover user accesspatterns.

Web mining is similar to data mining. In traditional data mining, data collection is abigger task. But in web mining, data collection can be substantial task. But once the data iscollected, it can be then preprocessed. Then mining algorithms can be applied to thesedata.

NOTES

169



Web Structure Mining

This represents the analysis of link structure of the web. The website X which containsdocument A may have a hyperlink to another document B of web site Y. This means thatdocument B is useful to document A.

HITS (Hyperlink Induced Topic Search) are a common algorithm to get knowledgeto access documents relating to the topic. Also it aims to find the authoritative page for agiven topic.

The algorithm accepts a set of web site references as input. This is called Seed Set.Typically it can be around 200 links. The HITS algorithm adds more references to the setto expand it to a set T. The set T is called target set. The web links are measured asweights. If it contains reference to an authoring site or if a authority site have links to a givenpage. Then it is weighed more, and then it is weighted more. The outgoing links determineweight of the hub.

1) Accept set S. Let p be the page of the set S2) Initialize the weight of the hub to 1 for each page of the set S.3) Initialize the weight of the authority to 1 for each page of the set S4) Let the expression p q represent p has a hyper link to web page q.5) Update weight of the authority, weight of the hub for each page p of the

set S.

Authority-weight (p) = hub-weight (q)q p

hub-weight (p) = authority-weight (q)p q

6) Exit.

Some of the problems associated with this algorithm is that it does not takeautomatically generated hyperlinks into account, it does not exclude the irrelevant or lessrelevant documents. Also the algorithm has problems like topic hijacking where manyhyperlinks points to the same web page. Drifting is another problem where the algorithmfails to concentrate on the specific topic mentioned by the query.

Page-rank: Web graph

The semantic structure of the web can be constructed based on page-to block andblock- to-page relationship also.

Block-to-page relationship captures the several semantic blocks present in a webpage. If d represents these block-to-page matrix then

DMC 1628

NOTES


Z =

Page-to-block relationship is defined as

Xij =

F is a function that assigns an importance value of every block ‘b’ in the page ‘p’. Thefunction is empirically defined as the ratio between the size of the block b / and the distancebetween the center of b and center of the screen multiplicity ‘’. ‘’ is the normalizationfactor to make the sum of fp (b) to be 1. Based on the value of X and Z , web page graphcan be constructed as

Web graph = XZ

Web Content MiningHere patterns are extracted from online sources such as HTML files, text documents,

e-mail messages etc. For example, summarization of a web page is a data-mining task.There are two approaches being explored here.

Local Knowledge-base model

In this model, web pages are collected. Then they are categorized. For example thecategories may be games, education. References to many web pages are collected underthis category. Based on the query, category is first selected then the search is performedfor the web page.

Agent based model

The agents can get the requirements of the user, then uses artificial intelligence conceptsto discover and organize the documents. The artificial intelligence ranges from user profilingto customized or personalized web agents.

Web Usage Mining

Web page mining analyzes the behavior of the customers. Basically here, web loganalysis is carried out for web usage mining. The usage access pattern are used by theorganizations to devise strategy for various marketing applications.

5.7 DISTRIBUTED DATA MINING

Distributed Computing plays an important role in the data mining process. This is dueto the advances in computing and communication over wired and wireless networks. Datamining and knowledge discovery in large amounts of data can benefit from the use ofparallel and distributed computational environments. This includes different distributedsources of voluminous data, multiple compute nodes, and distributed user community. The

i 1 / s if there is a link between block i to page j 0 otherwise

i j j ifp b if b p

0 otherwise

NOTES

171



field of distributed data mining (DDM) deals with this problem—mining distributed databy paying careful attention to the distributed resources.

The need for Distributed Data Mining (DDM) is due to the fact that data miningrequires huge amount of data. Hence there is a need to distribute the data to provide ascalable system. Also the data is inherently distributed into different databases.

Distributed data mining uses many technologies for distributed association mining,rules mining and clustering. Some of the very popular implementations are for distributedrule mining. This distributed rule mining involves some of the approaches that are mentionedhere

• Synchronous Tree Construction (data parallelization)– In this model there is no need for movement of data. But this approach

requires high communication cost as tree becomes bushy.• Partitioned Tree Construction (task parallelization)

– In this model the processors work independently once partitionedcompletely. But this model involves load imbalance and high cost of datamovement

• Hybrid Algorithm– This method combines good features of two approaches as this model

adapts dynamically according to the size and shape of trees

Often the distributed data mining uses a host of technologies to mine the data system.There is no standardization of architecture for distributed data mining. For simplicity sakea open source architecture called JAM (Java Aglet in Metalearning) can be mentionedhere for better understanding. This architecture uses agents to support DDM. Agents aremobile carriers. The agents are used to generate and transport the trained classifiers, whilemeta-learning combines these classifiers. This improves the efficiency as the classifiers aregenerated at the different data sites in parallel. The classifiers are computed over thesestationary data sets.

Data Mining

5.8 TOOLS FOR DATA MINING AND CASE STUDIES

5.8.1 Data Mining OLE DB Miner

Data mining component is include in SQL server 2000/2005.OLE DB is also becominga standard for data mining application.

SQL server introduced data mining features for the first time. Initially two algorithmswere introduced: Microsoft decision tree and Microsoft clustering. Data mining componentsbecame part of Microsoft Analysis Service.

DMC 1628

NOTES


Microsoft Analysis has two components1. OLAP Services2. Data mining

OLAP and Data mining serve complementary to each other.What is the need for OLE DB?Existing packages have many problems. They are

1. All the packages have their own notion of developing such that they do notcommunicate with each other. As package A communicate or interact with B.

2. Most packages are horizontal packages. Hence there is problem of integrationwith user applications.

3. Another problem is, most of the data mining products extract data and store it in anintermediate storage. Data porting and transformation is a difficult and expensiveoperation. Instead data mining can be applied directly on the stored data.

Microsoft aims to remove the above said problems and aims at standardization. Itaims to provide an Industrial standard so that the data mining algorithms can be easilyplugged thus providing an common interface with both data mining consumer and datamining provider.

The basic architecture of the OLE DB miner is shown in the Figure 5.1 below.

Figure 5.1 Distributed Data Mining Framework

SITE 1 E N G I N E

DD M INT E RFACE

CL IENT /SE RVE R

CONFIGURATION MANAGER

Repository/ Classifiers

SITE 2 DDM

INTERFACE ENGINE

CLIENT /SERVER


SITE 3 DDM

INTERFACE

ENGINE

CLIENT/SERVER


DISTRIBUTED DATA MINING NETWORK

NOTES

173



Microsoft has complemented OLE DB provider based on DM specification. Datamining provider is part of the Analysis Services. There is no sophisticated GUI. But thereare many wizards available as part of the Analysis Service to help data mining (ReferFigure 5.2)

Figure 5.2 Architecture of OLE DB Miner.

Microsoft 2000 has complemented two algorithms as part of SQL Server 2000.Both classification and clustering are highly scalable algorithms and suitable for large datasets.

One of the biggest advantages is, it is based on OLE DB for data mining application.Suppose if a college wants to mine student data to get a insight of the user requirements, allit has to do is to conclude the data mining algorithms in the Student Information System(Figure 5.3)

Figure 5.3 OLE DB Analysis

Consumer

OLE DB (API)

Data Mining Provider

OLE DB

Data Source

DM Wizard

DM Editor

DM Browser

Analysis Server

DMC 1628

NOTES


The developer can use mining models developed using VB or C++ or using wizardsof the Analysis Manager. The wizards generate data mining models then issue text queriessimilar to data bases.

Mining models are like containers. They do not store data but instead use data directlyin the RDMS created using SQL Server. The mining model looks like

Figure 5.4: Student Information System

CREATE MINING MODEL Student

{

Id long key,

Name text discrete,

Age long continuous

}

USING [Microsoft Decision Tree]

Once a model is created, the algorithm analyzes the input data. Any tabular datasource can be input for the model provided there is a OLE DB driver. To be consistent,SQL Server provides syntax similar to SQL. SQL Server provides commands like ‘OPENrow set’ to access data from the remote site also. Data need not have to be loaded aheadof time. This service is called in-place mining. After the training is over, data mining algorithmscan be applied. Once the data mining returns the patterns, the user can browse the miningmodel to look at the discovered patterns.

5.8.2 WEKA Introduction With Case Studies

There is no single suite which provides all the data mining tasks. The users must usesometimes different suites for their requirements.

WEKA workbench is a collection of data mining learning algorithms. It is designed insuch a manner that the user can explore data mining algorithms quickly over their datasets.

Student Information System

DM Provider

DM Models

NOTES

175



The major advantages of WEKA system are

1. IT is a open source software. Hence it is free but rich in features. Hence it ismaintainable, modifiable.

2. It provides many algorithms so that the user can easily explore, experiment andcompare classifiers.

3. It is developed in Java, so it is truly portable and can run on any machine.4. It is easy to use.

The advantages of WEKA are

All algorithms are main-memory based. This limitation prevents the usage of WEKAfor larger datasets. For larger datasets, some sort of sub-sampling should be used.

Exploring WEKA:

1. Preprocessing:

WEKA requires input in CSV format or the native ARFF file formats. Databaseaccess is provided using JDBC. So using SQL, data can be accessed from any database.WEKA provides many filters to preprocess the data.

2. Classify3. Cluster4. Associate5. Selection of attributes6. visualize

WEKA provides a knowledge flow interface for specifying a data stream by graphicallyconnecting components representing data sources, preprocessing tools, learning algorithms,evaluation and visualization tools.

WEKA provides an experimental component to run and compare the differentclassification and regression algorithms with different parameter values. This also facilitatesto distribute load across many machines using Java RMI.

Methods, Algorithms/Architecture:

WEKA provides a comprehensive set of useful algorithms. Weka algorithms cover awide range of algorithms. The range of algorithms includes filters, clustering, classification,association rule learning, and regression. Virtually all algorithms are supported by WEKATo facilitate, operations as flexible as possible, WEKA is designed with a modular, object-oriented architecture. So any new algorithms can be included easily.

DMC 1628

NOTES


WEKA implementation is a top level package called “core”. This provides the globaldata structures, classes, instances, attributes. Even data mining task is available as a subpackages like classification, cluster etc.

This whole set of features make WEKA as an attractive, open source mechanism toexplore the benefits of data mining.

Case Study 1

Weka is effective in solving data mining problem. This case study is aimed todemonstrate the use of weka. The data set that is chosen for demonstration is weatherdata.

The weather data is a nominal data and it has 14 records. The attributes of thetable are given below.

Instances: 14Attributes: 5

outlook temperature humidity windy play

The aim is to decide whether the kid can play or not?

The first step is the collection of data. Weka requires that the data should be given ina format called ARFF. ARFF stands for Attribute Relation File.

The ARFF file of this database is given below.

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@data

sunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yes

NOTES

177



rainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no

The file first specifies the attribute list followed by the actual data.

Weka provides rich data mining functionalities. It provides a GUI using which theuser can select the required tasks.

For the demonstration sake, a ID3 algorithm is selected. When applied, Wekaproduces this classification model.

The resultant classification model is given below=== Classifier model (full training set) ===

Id3outlook = sunny| humidity = high: no| humidity = normal: yesoutlook = overcast: yesoutlook = rainy| windy = TRUE: no| windy = FALSE: yes

This is the decision tree produced by Weka. The terminal nodes are classes. The rootand internal nodes test the attributes.

It can be seen that the Time taken to build model is only 0.01 seconds. The quality ofthe classification model is given by confusion matrix. The confusion matrix can be analyzedusing the metrics discussed in the earlier sections.

=== Confusion Matrix ===

a b <— classified as 8 1 | a = yes 1 4 | b = no

DMC 1628

NOTES


The confusion matrix yields the following conclusions

=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 12 85.7143 %Incorrectly Classified Instances 2 14.2857 %Kappa statistic 0.6889Mean absolute error 0.1429Root mean squared error 0.378Relative absolute error 30 %Root relative squared error 76.6097%Total Number of Instances 14

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.889 0.2 0.889 0.889 0.889 0.844 yes 0.8 0.111 0.8 0.8 0.8 0.844 no

We can choose another algorithm J4.8 of weka (Java implementation of C4.5) algorithmthat uses information gain. (Refer chapter 3)

The trace of the algorithm is shown below. The same kind of analysis can be made forthis corresponding result also.

J48 pruned tree———————outlook = sunny| humidity = high: no (3.0)| humidity = normal: yes (2.0)outlook = overcast: yes (4.0)outlook = rainy| windy = TRUE: no (2.0)| windy = FALSE: yes (3.0)

Number of Leaves : 5

Size of the tree : 8

Time taken to build model: 0.06 seconds

=== Stratified cross-validation ====== Summary ===

NOTES

179



Correctly Classified Instances 7 50 %Incorrectly Classified Instances 7 50 %Kappa statistic -0.0426Mean absolute error 0.4167Root mean squared error 0.5984Relative absolute error 87.5 %Root relative squared error 121.2987 %Total Number of Instances 14


TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.556 0.6 0.625 0.556 0.588 0.633 yes0.4 0.444 0.333 0.4 0.364 0.633 no


a b <— classified as 5 4 | a = yes 3 2 | b = no

This is the result of the association rule mining. Association rule mining algorithm triesto associate attributes to generate rule.

Apriori algorithm (Refer chapter 2) is used to generate the following association rules.

Apriori=======

Minimum support: 0.15 (2 instances)Minimum metric <confidence>: 0.9Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12




Best rules found:

1. outlook=overcast 4 ==> play=yes 4conf:(1)2. temperature=cool 4 ==> humidity=normal 4 conf:(1)

DMC 1628

NOTES


3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1)10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)

We can notice that all the rules are associated with confidence factor.

Getting a good quality data set is a difficult task. So for testing algorithms. It is better togenerate random datasets (Synthetic data sets) for testing purposes.

%% Commandline%% weka. datagenerators. classifiers. classification. Agrawal -rweka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-P_0.05 -S1 -n 100 -F 1 -P 0.05

%@relation weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-P_0.05

@attribute salary numeric@attribute commission numeric@attribute age numeric@attribute elevel {0,1,2,3,4}@attribute car {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}@attribute zipcode {0,1,2,3,4,5,6,7,8}@attribute hvalue numeric@attribute hyears numeric@attribute loan numeric@attribute group {0,1}

@data

110499.735409,0,54,3,15,4,135000,30,354724.18253,1140893.779095,0,44,4,20,7,135000,2,395015.33902,1119159.651677,0,49,2,1,3,135000,22,122025.085242,120000,52593.636537,56,0,9,1,135000,30,99629.621457,1

=== Run information ===

NOTES

181



Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2Relation: weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-P_0.05Instances: 100Attributes: 10

salary commission

age elevel

car zipcode hvalue hyears loan groupTest mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree—————————

age <= 37: 0 (31.0)age > 37| age <= 62: 1 (39.0/4.0)| age > 62: 0 (30.0)

Number of Leaves : 3

Size of the tree : 5

Time taken to build model: 0.14 seconds

=== Stratified cross-validation ====== Summary ===

DMC 1628

NOTES


Relative absolute error 21.4022 %Root relative squared error 53.9683 %Total Number of Instances 100=== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.938 0.086 0.953 0.938 0.946 0.912 0 0.914 0.062 0.889 0.914 0.901 0.912 1


a b <— classified as 61 4 | a = 0 3 32 | b = 1The results of IR algorithm (Refer chapter 3) for a random data is shown below


Scheme: weka.classifiers.rules.OneR -B 6Relation: weka.datagenerators.classifiers.classification.RDG1-S_1_-n_100_-a_10_-c_2_-N_0_-I_0_-M_1_-R_10Instances: 100Attributes: 11

a0a1 a2a3a4a5a6a7a8a9class

Test mode: 10-fold cross-validation

NOTES

183



=== Classifier model (full training set) ===a5:

false -> c0true -> c1

(74/100 instances correct)

Time taken to build model: 0 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 74 74 %Incorrectly Classified Instances 26 26 %Kappa statistic 0.4992Mean absolute error 0.26Root mean squared error 0.5099Relative absolute error 57.722 %Root relative squared error 107.5058 %Total Number of Instances 100=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.636 0.059 0.955 0.636 0.764 0.789 c0 0.941 0.364 0.571 0.941 0.711 0.789 c1


a b <— classified as 42 24 | a = c0 2 32 | b = c1

=== Run information ===Scheme: weka.classifiers.trees.Id3Relation: weka.datagenerators.classifiers.classification.RDG1-S_1_-n_100_-a_10_-c_2_-N_0_-I_0_-M_1_-R_10Instances: 100Attributes: 11

a0a1a2a3a4

DMC 1628

NOTES


a5a6a7a8a9class

Test mode: 10-fold cross-validation=== Classifier model (full training set) ===Id3a5 = false| a1 = false: c0| a1 = true| | a8 = false: c0| | a8 = true| | | a0 = false: c0| | | a0 = true| | | | a2 = false: c1| | | | a2 = true| | | | | a4 = false: c1| | | | | a4 = true: c0a5 = true| a8 = false| | a9 = false| | | a2 = false| | | | a3 = false: c0| | | | a3 = true| | | | | a1 = false: c0| | | | | a1 = true: c1| | | a2 = true| | | | a0 = false| | | | | a4 = false: c1| | | | | a4 = true: c0| | | | a0 = true: c1| | a9 = true| | | a3 = false| | | | a0 = false: c1| | | | a0 = true| | | | | a7 = false: c0| | | | | a7 = true: c1| | | a3 = true: c1

NOTES

185



| a8 = true| | a1 = false| | | a2 = false| | | | a0 = false| | | | | a3 = false: c0| | | | | a3 = true: c1| | | | a0 = true: c0| | | a2 = true| | | | a4 = false: c1| | | | a4 = true: c0| | a1 = true| | | a7 = false: c0| | | a7 = true| | | | a0 = false| | | | | a2 = false: c1| | | | | a2 = true: c0| | | | a0 = true: c0

Time taken to build model: 0.06 seconds=== Stratified cross-validation ====== Summary ===Correctly Classified Instances 78 78 %Incorrectly Classified Instances 22 22 %Kappa statistic 0.5234Mean absolute error 0.22Root mean squared error 0.469Relative absolute error 48.8417 %Root relative squared error 98.891 %Total Number of Instances 100


TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.803 0.265 0.855 0.803 0.828 0.769 c0 0.735 0.197 0.658 0.735 0.694 0.769 c1

=== Confusion Matrix ===a b <— classified as53 13 | a = c0 9 25 | b = c1

DMC 1628

NOTES



The results of the Apriori Algorithm (Refer Chapter 3) of the above data set using Weka isshown below

Scheme:weka.associations.Apriori -N 10 -T 0 -C 0.9-D 0.05-U 1.0-M 0.1-S -1.0 -c -1

Relation: weka.datagenerators.classifiers.classification.RDG1-S_1_-n_100_-a_10_-c_2_-N_0_-I_0_-M_1_-R_10

Instances: 100Attributes: 11 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 class=== Associator model (full training set) ===Apriori=======Minimum support: 0.2 (20 instances)Minimum metric <confidence>: 0.9Number of cycles performed: 16

Generated sets of large itemsets:

Size of set of large itemsets L(1): 22Size of set of large itemsets L(2): 182Size of set of large itemsets L(3): 56

Best rules found:

1. a1=false a5=false 24 ==> class=c0 24 conf:(1) 2. a5=false a8=false 24 ==> class=c0 24 conf:(1) 3. a5=false a6=false 23 ==> class=c0 23 conf:(1) 4. a8=false class=c1 22 ==> a5=true 22 conf:(1) 5. a5=false a7=true 21 ==> class=c0 21 conf:(1)

NOTES

187



6. a5=false a9=false 21 ==> class=c0 21 conf:(1) 7. a3=false a5=false 20 ==> class=c0 20 conf:(1) 8. a6=false class=c1 20 ==> a5=true 20 conf:(1) 9. a2=false a5=false 27 ==> class=c0 26 conf:(0.96)10. a4=false a5=false 23 ==> class=c0 22 conf:(0.96)

5.8.3 Selection of Data Mining Tool

One of the major decisions, the business organizations should exercise is to select asuitable data mining tool for their requirement. The best tool need not have to be advancedone. Some of the requirements of the business organizations should look into the selectionof tools are listed below.

1. Data types: The data types can be either record based relational data to specializeddata like – spatial, stream data, time series or web data. The company shouldhave a clear idea about the kind of data they will be dealing with. No single suitewill support all the data types.

2. System Issues: Issues like operating system, machine requirements interfaceslike XML play an import role in selection.

3. Data sources4. Ease of use: Many data mining tasks can be performed by a plain programmer

instead of an expert statistician. Modern data mining tools should have suitableGUI to ease the pressure of using the tool thereby increasing the learning curve.

A good GUI is required to perform user-guided, high quality interactive data mining.Lacks of standards is a primary issue. The requirements of the business organizations mayforce them to choose functionalities of many data mining suites can be made use of.

1. Visualization tools

It is better to have visualization capability. Business organizations deal with terabytes.Exploration of this amount of data is plan impossible. So visualization is required forvisualizing the data, result and process. The quality and flexibility of visualization toolsshould be evaluated by the business organizations to select a suitable suite.

2. Accuracy

The accuracy of the data mining tool is important. A good tool with acceptable levelof accuracy normally influences organizations selection of tools.

3. Common tasks

The data mining suite capability to perform many data mining tasks should be evaluated.The requirements of the organization vary. Some organizations are interested in OLAPanalysis, association mining, while some organizations may be interested in prediction andtrend analysis.

DMC 1628

NOTES


It is impossible for any data mining suite to provide all facilities. Sometimes, businessorganizations need to perform multiple tasks and may wish to integrate the tasks. Thisprovides flexibility to the organization.

Some of the data mining tasks require data warehouse and some may not. Hence, itis the requirement of the user to explore the data mining functionalities to decide upon thesuitability of the data mining suite for the business organizations.

4. Scalability

Data mining has the scalability issues. Some tools prove only in-memory algorithmwhich makes application of suites questionable especially when organization have verylarger datasets. Hence this should be the major criteria of tool selection for the organizations.

These are some of the features that needs to be considered along with the requirementsof the organization to consider the suitability of the suite. Some are the commercial suitesthat are commonly used by the organization are Microsoft SQL Server. Intelligent Miner ofIBM, mine set, Oracle Data mining, Clementine, Enterprise Miner etc. Some of the toolsare open source projects like WEKA.

Summary

Data mining is used in various domains like business, telecommunication, marketing,web analysis, medicine, security and scientific domain.

One of the major issues of data mining are privacy and confidentiality of the privatedata

Text mining mines the text present in the large collection of documents The basic measures of text retrieval are precision, recall and F-score. Text mining includes keyword based association analysis, document classification

and document clustering Text indexing includes inverted index and underlying signature file Spatial mining is the process of discovering hidden but potentially useful patterns

from a large set of spatial data Spatial mining includes spatial OLAP, association rules, classification, trend analysis

and spatial clustering methods. Web mining includes web structure mining, web content mining and web usage

mining. There is no single data mining suite which provides all the data mining tasks. Some

of the popular tool suites are weka and Microsoft OLE DB. Some of the criteria for choosing data mining suites are the ability to handle many

data types, ease of use, visualization capability, accuracy and data miningfunctionalities.

NOTES

189



DID YOU KNOW?

1. What is the difference between ubiquitous data mining and traditional data miningapplications?

2. What is meant by privacy data?3. Guarding the privacy of the data is a difficult task for data mining algorithms. Justify.

Short Questions

1. Enumerate some of the applications of data mining.2. What is the role of “Fair Information Report”?3. What are the measures of text retrieval?4. What is meant by vector-space model?5. Enumerate some of the spatial queries?6. Enumerate some of the text mining tasks?7. What are the kinds of the web mining?8. What are the salient features of a data mining suite?9. What are the social implications of data mining?10. Enumerate some of the issues of data mining.

LONG QUESTIONS

1. What are the major issues that confront data mining? Explain in detail.2. Explain in detail spatial data mining.3. Explain the differences between text mining and text retrieval.4. Explain in detail text mining tasks.5. Explain in detail the HITS algorithm of page ranking.6. Explain the criteria for selecting a data mining suite.

DMC 1628

NOTES


NOTES

NOTES

191



NOTES

DMC 1628

NOTES


NOTES

Documents

Dmc 1628 Data Warehousing and Data Mining