Data Mining

Data Mining

Data for Data MiningByProf. Muhammad Amir Alam

Data for Data Mining

Data for data mining comes in many forms: from computer files typed in by human operators, business information in SQL or some other standard database format, information recorded automatically by equipment such as fault logging devices, to streams of binary data transmitted from satellites.

The universe of objects is normally very large and we have only a small part of it. Usually we want to extract information from the data available to us that we hope is applicable to the large volume of data that we have not yet seen.

Standard Formulation

We will assume that for any data mining application we have a universe of objects that are of interest.For exampleall human beings alive or deadpossibly all the patients at a hospitalall the rocks on the moonall the pages stored in the World Wide Web

Example(DataSet)

SoftEng ARIN HCI CSA Project Class

B A A B A Second

A B A B A First

B A B A B Second

C A A A A First

A A A A A First

A A A A B Second

…………… -------------- -------------- -------------- -------------- ----------

A B A B A First

The set of variable values corresponding to each of the objects is called arecord or (more commonly) an instance.This dataset is an example of labelled data, where one attribute is givenspecial significance and the aim is to predict its value.When there is no such significant attribute we call the data unlabelled.

Types of VariableSix main types of variable can be distinguished.

Nominal Variables:A variable used to put objects into categories, e.g. the name or colour of an object. A nominal variable may be numerical in form, but the numerical valueshave no mathematical interpretation. For example we might label 10 peopleas numbers 1, 2, 3, . . . , 10, but any arithmetic with such values

Binary Variables:A binary variable is a special case of a nominal variable that takes only two possible values: true or false, 1 or 0 etc.

Ordinal VariablesOrdinal variables are similar to nominal variables, except that an ordinal variable has values that can be arranged in a meaningful order, e.g. small, medium,large.Integer VariablesInteger variables are ones that take values that are genuine integers, for example‘number of children’. Unlike nominal variables that are numerical in form, arithmetic with integer variables is meaningful (1 child + 2 children = 3 children etc.).Interval-scaled VariablesInterval-scaled variables are variables that take numerical values which are measured at equal intervals from a zero point or origin.Ratio-scaled VariablesRatio-scaled variables are similar to interval-scaled variables except that the zero point does reflect the absence of the measured characteristic

Data Preparation

Data Cleaning:when the data is in the standard form it cannot be assumed that itis error free. becauseMeasurement errorsSubjective judgmentsMalfunctioningOr misuse of automatic recording equipmentFor example:The number 69.72 may accidentally be entered as 6.972Or“bbrown” for “brown” but this type of errors are less harmonious

Missing ValuesIn many real-world datasets data values are not recorded for all attributes. This can happen for various reasonsa malfunction of the equipment used to record the dataa data collection form to which additional fields were

added after some data had been collected information that could not be obtained, e.g. about a

hospital patient

There are many techniques to recover missing values:Two of them are:1. Discard Instances2. Replace by most important/ frequent / average values

Deal with missing valuesDiscard Instances This is the simplest strategy: delete all instances where there is at least one missing value and use the remainder.

Replace by Most Frequent/Average Value A straightforward but effective way of doing this for a categorical attribute is to use its most frequently occurring (non-missing) value. This is easy to justify if the attribute values are very unbalanced. For example if attribute X has possible values a, b and c which occur in proportions 80%, 15% and 5% respectively, it seems reasonable to estimate any missing values of attribute X by the value a.

Documents

Data Mining