12
Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 12

Data Mining - Sharif

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining - Sharif

Data MiningGetting to know your data

Hamid Beigy

Sharif University of Technology

Fall 1394

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 12

Page 2: Data Mining - Sharif

Table of contents

1 Introduction

2 Getting to know your data

3 Statistical description of data

4 Data visualization

5 Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 2 / 12

Page 3: Data Mining - Sharif

Data mining process

A typical knowledge discovery process is

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 7 #7

1.2 What Is Data Mining? 7

Flat filesDatabases

DataWarehouse

Patterns

Knowledge

Cleaning andintegration

Selection andtransformation

Datamining

Evaluation andpresentation

Figure 1.4 Data mining as a step in the process of knowledge discovery.Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 3 / 12

Page 4: Data Mining - Sharif

Getting to know your data

Real-world data are typically noisy, enormous in volume, and may originate fromheterogenous sources.

The first step of data mining is to know the data. We need to know

What are the type of attributes or fields that make up the data?What kind of values does each attribute have?Which attributes are discrete and which are continuous-valued?How are the values distributed?Are the ways we can visualize the data to get a better sense of it?Can we spot any outlier?Can we measure the similarity of some data objects with respect to others?

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 4 / 12

Page 5: Data Mining - Sharif

Attribute types

Nominal attributes The values of nominal attributes are symboles or name of things. Eachvalue represents some kind of category, code, or state, and nominal attributes arealso referred as categorical. The values does not have any meaningful order.

Binary attributes A binary attribute is a nominal attribute with only two categories.

A binary attribute is symmetric if both of its states are equally valuable andcarry the same weight.A binary attribute is asymmetric if the outcomes of the states are notequally important.

Ordinal attributes An ordinal attribute is an attribute with possible values that have ameaningful order or ranking among them, but the magnitude between successivevalues is not known.

Numerical attributes A numeric attribute is quantitative; that is, it is a measurable quantity,represented in integer or real values. Numeric attributes can be interval-scaled orratio-scaled.

Interval-scaled attributes are measured on a scale of equal-size units. Theirvalues have order. We can compare and quantify the difference betweenvalues. Temperatures in Celsius(Fahrenheit) do not have a true zero-point.A ratio-scaled attribute is a numeric attribute with an inherent zero-point.Their values have order and allow us to compare and quantify the differencebetween values. Temperatures in the Kelvin has a true zero-point.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 5 / 12

Page 6: Data Mining - Sharif

Statistical description of data

For data preprocessing to be successful, it is essential to have an overall picture of yourdata. Basic statistical descriptions can be used to identify properties of the data andhighlight which data values should be treated as noise or outliers.Three basic statistical descriptions aremeasures of central tendency This measures the location of the middle or center of a

data distribution. such as mean, median, and mode.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 47 #9

2.2 Basic Statistical Descriptions of Data 47

lower than the median interval, freqmedian is the frequency of the median interval, andwidth is the width of the median interval.

The mode is another measure of central tendency. The mode for a set of data is thevalue that occurs most frequently in the set. Therefore, it can be determined for qualita-tive and quantitative attributes. It is possible for the greatest frequency to correspond toseveral different values, which results in more than one mode. Data sets with one, two,or three modes are respectively called unimodal, bimodal, and trimodal. In general, adata set with two or more modes is multimodal. At the other extreme, if each data valueoccurs only once, then there is no mode.

Example 2.8 Mode. The data from Example 2.6 are bimodal. The two modes are $52,000 and$70,000.

For unimodal numeric data that are moderately skewed (asymmetrical), we have thefollowing empirical relation:

mean � mode ⇡ 3 ⇥ (mean � median). (2.4)

This implies that the mode for unimodal frequency curves that are moderately skewedcan easily be approximated if the mean and median values are known.

The midrange can also be used to assess the central tendency of a numeric data set.It is the average of the largest and smallest values in the set. This measure is easy tocompute using the SQL aggregate functions, max() and min().

Example 2.9 Midrange. The midrange of the data of Example 2.6 is 30,000+110,0002 = $70,000.

In a unimodal frequency curve with perfect symmetric data distribution, the mean,median, and mode are all at the same center value, as shown in Figure 2.1(a).

Data in most real applications are not symmetric. They may instead be either posi-tively skewed, where the mode occurs at a value that is smaller than the median(Figure 2.1b), or negatively skewed, where the mode occurs at a value greater than themedian (Figure 2.1c).

Mode

Median

Mean Mode

Median

MeanMeanMedianMode

(a) Symmetric data (b) Positively skewed data (c) Negatively skewed data

Figure 2.1 Mean, median, and mode of symmetric versus positively and negatively skewed data.Measuring the data dispersion This measures how are the data spread out. The mostcommon data dispersion measures are range, quartile, interquartile range(IQR), five-numbers summary, box plots, variance, and standard deviation.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 48 #10

48 Chapter 2 Getting to Know Your Data

2.2.2 Measuring the Dispersion of Data: Range, Quartiles, Variance,Standard Deviation, and Interquartile RangeWe now look at measures to assess the dispersion or spread of numeric data. The mea-sures include range, quantiles, quartiles, percentiles, and the interquartile range. Thefive-number summary, which can be displayed as a boxplot, is useful in identifyingoutliers. Variance and standard deviation also indicate the spread of a data distribution.

Range, Quartiles, and Interquartile RangeTo start off, let’s study the range, quantiles, quartiles, percentiles, and the interquartilerange as measures of data dispersion.

Let x1,x2, . . . ,xN be a set of observations for some numeric attribute, X . The rangeof the set is the difference between the largest (max()) and smallest (min()) values.

Suppose that the data for attribute X are sorted in increasing numeric order. Imaginethat we can pick certain data points so as to split the data distribution into equal-sizeconsecutive sets, as in Figure 2.2. These data points are called quantiles. Quantiles arepoints taken at regular intervals of a data distribution, dividing it into essentially equal-size consecutive sets. (We say “essentially” because there may not be data values of X thatdivide the data into exactly equal-sized subsets. For readability, we will refer to them asequal.) The kth q-quantile for a given data distribution is the value x such that at mostk/q of the data values are less than x and at most (q � k)/q of the data values are morethan x, where k is an integer such that 0 < k < q. There are q � 1 q-quantiles.

The 2-quantile is the data point dividing the lower and upper halves of the data dis-tribution. It corresponds to the median. The 4-quantiles are the three data points thatsplit the data distribution into four equal parts; each part represents one-fourth of thedata distribution. They are more commonly referred to as quartiles. The 100-quantilesare more commonly referred to as percentiles; they divide the data distribution into 100equal-sized consecutive sets. The median, quartiles, and percentiles are the most widelyused forms of quantiles.

Q2 Q3Q1

25thpercentile

75thpercentile

Median

25%

Figure 2.2 A plot of the data distribution for some attribute X . The quantiles plotted are quartiles. Thethree quartiles divide the distribution into four equal-size consecutive subsets. The secondquartile corresponds to the median.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 6 / 12

Page 7: Data Mining - Sharif

Statistical description of data (cont.)

Box plots are a popular way of visualizing a distribution. A box plot incorporates thefive-numbers summary (min, max, Q1, Q3, median).

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 50 #12

50 Chapter 2 Getting to Know Your Data

20

40

60

80

100

120

140

160

180

200

220Un

it pric

e ($)

Branch 1 Branch 4Branch 3Branch 2

Figure 2.3 Boxplot for the unit price data for items sold at four branches of AllElectronics during a giventime period.

When dealing with a moderate number of observations, it is worthwhile to plotpotential outliers individually. To do this in a boxplot, the whiskers are extended to theextreme low and high observations only if these values are less than 1.5 ⇥ IQR beyondthe quartiles. Otherwise, the whiskers terminate at the most extreme observations occur-ring within 1.5 ⇥ IQR of the quartiles. The remaining cases are plotted individually.Boxplots can be used in the comparisons of several sets of compatible data.

Example 2.11 Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at four branches ofAllElectronics during a given time period. For branch 1, we see that the median price ofitems sold is $80, Q1 is $60, and Q3 is $100. Notice that two outlying observations forthis branch were plotted individually, as their values of 175 and 202 are more than 1.5times the IQR here of 40.

Boxplots can be computed in O(n logn) time. Approximate boxplots can be com-puted in linear or sublinear time depending on the quality guarantee required.

Variance and Standard DeviationVariance and standard deviation are measures of data dispersion. They indicate howspread out a data distribution is. A low standard deviation means that the data observa-tions tend to be very close to the mean, while a high standard deviation indicates thatthe data are spread out over a large range of values.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 7 / 12

Page 8: Data Mining - Sharif

Statistical description of data (cont.)

In Graphic displays of basic statistical description of data, graphs are helpful for visualinspection of data. These includes

Quantile plotsQuantilequantile plotsHistograms

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 55 #17

2.2 Basic Statistical Descriptions of Data 55

6000

5000

4000

3000

2000

1000

0

Cou

nt o

f it

ems

sold

40–59 60–79 80–99 100–119 120–139Unit price ($)

Figure 2.6 A histogram for the Table 2.1 data set.

Unit price ($)

Item

s so

ld

0

700

600

500

400

300

200

100

020 40 60 80 100 120 140

Figure 2.7 A scatter plot for the Table 2.1 data set.

(a) (b)

Figure 2.8 Scatter plots can be used to find (a) positive or (b) negative correlations between attributes.

Scatter plots.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 8 / 12

Page 9: Data Mining - Sharif

Statistical description of data (Histogram)

Plotting histograms is a graphical method for summarizing the distribution of a givenattribute X . If X is nominal, then a plot is drawn for each value of X . If X is numeric,the range of values for X is partitioned into disjoint consequitive subranges (buckets) orbins. The value of bucket (height of a bar) indicates the frequency of that X value. Theresulting graph is more commonly known as a bar chart.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 55 #17

2.2 Basic Statistical Descriptions of Data 55

6000

5000

4000

3000

2000

1000

0

Cou

nt o

f it

ems

sold

40–59 60–79 80–99 100–119 120–139Unit price ($)

Figure 2.6 A histogram for the Table 2.1 data set.

Unit price ($)

Item

s so

ld

0

700

600

500

400

300

200

100

020 40 60 80 100 120 140

Figure 2.7 A scatter plot for the Table 2.1 data set.

(a) (b)

Figure 2.8 Scatter plots can be used to find (a) positive or (b) negative correlations between attributes.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 9 / 12

Page 10: Data Mining - Sharif

Statistical description of data (Scatter plots)

A scatter plot is a graphical method for determining if there appears to be a relationshipsbetween two numeric attributes (if any).

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 55 #17

2.2 Basic Statistical Descriptions of Data 55

6000

5000

4000

3000

2000

1000

0

Cou

nt o

f ite

ms

sold

40–59 60–79 80–99 100–119 120–139Unit price ($)

Figure 2.6 A histogram for the Table 2.1 data set.

Unit price ($)

Item

s so

ld

0

700

600

500

400

300

200

100

020 40 60 80 100 120 140

Figure 2.7 A scatter plot for the Table 2.1 data set.

(a) (b)

Figure 2.8 Scatter plots can be used to find (a) positive or (b) negative correlations between attributes.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 55 #17

2.2 Basic Statistical Descriptions of Data 55

6000

5000

4000

3000

2000

1000

0

Cou

nt o

f ite

ms

sold

40–59 60–79 80–99 100–119 120–139Unit price ($)

Figure 2.6 A histogram for the Table 2.1 data set.

Unit price ($)

Item

s so

ld

0

700

600

500

400

300

200

100

020 40 60 80 100 120 140

Figure 2.7 A scatter plot for the Table 2.1 data set.

(a) (b)

Figure 2.8 Scatter plots can be used to find (a) positive or (b) negative correlations between attributes.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 10 / 12

Page 11: Data Mining - Sharif

Data visualization

Data visualization aims to communicate data clearly and effectively through graphicalrepresentation.

We can take advantage of visualization techiques to discover data relationships that existbut are not easily observable by looking at the raw data.

Consider the visualization of a data set using scatter plots

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 60 #22

60 Chapter 2 Getting to Know Your Data

Figure 2.14 Visualization of a 3-D data set using a scatter plot. Source: http://upload.wikimedia.org/wikipedia/commons/c/c4/Scatter plot.jpg.

A data record is represented by a polygonal line that intersects each axis at the pointcorresponding to the associated dimension value (Figure 2.16).

A major limitation of the parallel coordinates technique is that it cannot effec-tively show a data set of many records. Even for a data set of several thousand records,visual clutter and overlap often reduce the readability of the visualization and make thepatterns hard to find.

2.3.3 Icon-Based Visualization TechniquesIcon-based visualization techniques use small icons to represent multidimensionaldata values. We look at two popular icon-based techniques: Chernoff faces and stickfigures.

Chernoff faces were introduced in 1973 by statistician Herman Chernoff. They dis-play multidimensional data of up to 18 variables (or dimensions) as a cartoon humanface (Figure 2.17). Chernoff faces help reveal trends in the data. Components of theface, such as the eyes, ears, mouth, and nose, represent values of the dimensions by theirshape, size, placement, and orientation. For example, dimensions can be mapped to thefollowing facial characteristics: eye size, eye spacing, nose length, nose width, mouthcurvature, mouth width, mouth openness, pupil size, eyebrow slant, eye eccentricity,and head eccentricity.

Chernoff faces make use of the ability of the human mind to recognize small dif-ferences in facial characteristics and to assimilate many facial characteristics at once.

Some visualization techniques

pixe-oriented techniquesgeometric projection techniquesIcon-based techniquesHierarchical techniquesGraph based techniques

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 11 / 12

Page 12: Data Mining - Sharif

Reading

Read chapter 2 of the following bookJ. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, MorganKaufmann, 2012.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 12 / 12