Knowledge Discovery and Data Mining (COMP 5318) S1, 2013

Knowledge Discoveryand Data Mining(COMP 5318)

S1, 2013

The Lecturing Team

Coordinator Sanjay Chawla, SIT

Lecturer Sanjay Chawla and Wei Liu (NICTA)

Tutors Didi Surian, Linsey Pang and Fei Wang (PhDStudents)

Material and Lectures

● Lectures will be posted on http://www.it.usyd.edu.au/~comp5318

● We will mainly follow the textbook fromRajaram, Leskovic and Ullman from Stanfordwhich is available online:

● However the ordering will be different

Assessment Package

In-Class Test (15%) Week 6

Group Assignment (20%) Week 10

Research Paper Presentation Week 11 - 12(15%)

Final Exam (50%) See Exam Calendar

To pass the class you must get at least 50% inthe final exam; and at least 40% (combined) inother assessments

So what is this class about....

1. Building large knowledge discovery pipelinesto solve real-world problems (a.k.a "Data

Analytic" pipelines)2. Learning about techniques to analyze and

algorithms to mine data3. Learn how to read original research in data

mining and machine learning4. Learn how to solve large data problems in

the cloud

Coordinates of Data Mining

Algorithms/Statistics and

Distributed & MachineLinear

Parallel LearningAlgebra

Computing

Data Mining

Database InformationComputerManagement RetrievalVisionSystems

Abstract Tasks in Data Mining

● Clustering and Segmentation: how toautomatically group objects into clusters○ Take photographs from Flickr and automatically

create categories● Classification and Regression: how tomake statistical models for prediction.

○ Predict whether an online user will click on a banneradvertisement

○ Predict the currency exchange rate tomorrow(AUS/USD)

○ Predict who will win the NBA champion in 2013

Abstract Tasks in Data Mining....cont

● 3. Anomaly Detection: Identify entitieswhich are different from the rest in the group

○ While galaxy is different in an astronomical database○ Which area has an unusual flu rate○ Is this credit card transaction fraudulent ?○ Identify cyber attacks: Denial of Service (DOS) and

Portscan○ Identify genes which are likely to cause a certain

disease

Knowledge Discovery Pipeline

Data Source

Data Source Data Integration PresentationData Mining Taskof Results

Data Source

Example: Large Scale AdvertisingSystems

3)< ad, bid> Advertiser 1<2;uid, url>

1)<user>

Publisher 3)< ad, bid>

Web Page Ad Exchange Advertiser n4)<wining ad >

7)<user>

<5; adid>Demand

3) <ad, bid> Side Platform

<6; ad_creative> Ad Server

Lets do something tangible...

Underlying all data mining tasks...is the notionof similarity..

1. When are two images similar ?2. When are two documents similar ?3. When are two patients similar ?4. When are two shopping-baskets similar ?5. When are two job candidate similar ?6. When are two galaxies similar ?7. When is network traffic similar ?

Data Vector

In Data Mining, data is often transformed to avector of numbers. e.g.,D1: computer science and physics have a lot incommon. In the former, we build models ofcomputation and in the latter, models of the

physical world.a the brain latter of world cheese in

0 1 0 1 1 1 0 3

What is the length of this vector ?

Data Vector....cont

700 x 500

4 45 6

6 12 33

22 17 44

www.sydney.visitorsbureau.com.au

4 45 6 6 12 33 22 17 44

Data Vector...cont

1.022 1.01 1.002 1.01 1.01 1.01 1.02 1.03 1.03 1.03

Similarity

● Once we have data vectors, we can start thecomputation process...for example,..whenare two data vectors similar

While pair of currency trades are more similar ?

Similarity Computation

Suppose want to compute similarity betweentwo vectors:x = <3,4,1,2>; y =<1,2,3,1>

Step 1: compute the length of each vector:||x||= (32 + 42 + 12 + 22 )1/2 = (9 + 16 + 1+4)1/2 = 5.48||y|| = (12 + 22 + 32+ 12 )1/2= (1 + 4 + 9 + 1)1/2 = 3.87

Step 2: compute the dot product:x.y = 3.1 + 4.2 + 1.3+ 2.1 = 3+8+3+2 = 16

Step 3: (x.y/||x|| ||y||) = (16/(5.48)(3.87)) = 0.75

Cosine Similarity

1. Thus similarity (sim(x,y)) between two datavectors x and y is given by x.y/(||x||.||y||)

2. This is called cosine similarity (Why ?)

3. This is a very general concept and underpinsmuch of data-driven computation

4. We will be coming back to it..over and overagain

More examples

x=<1,0,1,0>; y = <0,1,0,1>sim(x,y) = 0

x=<1,3,2,1>; y =<1,3,2,1>sim(x,y) = 1

If all elements of data vector are non-negative,then: 0 <= sim(x,y) <= 1

Cost of Computation

x = <3,4,1,2>; y =<1,2,3,1>Step 1: compute the length of each vector

||x||= 32 + 42 + 12 + 22 = 9 + 16 + 1+4 = 30 [4 mult; 3 adds +1 sqrt]||y|| = 12 + 22 + 32+ 12 = 1 + 4 + 9 + 1 =15 [4 mult; 3 adds +1 sqrt]Step 2: compute the dot product:x.y = 3.1 + 4.2 + 1.3+ 2.1 = 3+8+3+2 = 16 [4 mult; adds]Step 3: (x.y/||x|| ||y||) = (16/30.15) = 0.036 [1 mult, 1 divide]Total FLOPS (assuming 1 FLOP per operation) = d+d + d+

d + d + (d-1) + 1+1 = 6d + 1

Cost of Computation

Cost of Similarity between two vectors of lengthd, is 6d+1 or ~ 6d.

Suppose want to find the similarity between all Wikipedia documents.http://en.wikipedia.org/wiki/Wikipedia:StatisticsNumber of Articles: ~4,000,000Length of data vector: ~100000 [# of words in dictionary]

Number of pairwise combinations: ~ 8 x 1012

Number of flops: ~ 8 x 1012 x 6 x 105 = 48 x 1017 ~ 1018World's fastest computer (2012); Titan at Oak Ridge Labs: 27 peta flops (27

thousand, trillion flops; 1016 ) [ 100 seconds]

Current desktop: 3 Ghz; ~109 flops per second. Thus 109 seconds ~ 33 yrs.

Summary

We use data mining to build knowledgediscovery pipelines.

Data Mining is the process of applyingalgorithms to data.

A key concept is that of defining similaritybetween entities.

Knowledge Discovery and Data Mining (COMP 5318) S1, 2013

Documents

FrenchA 5318-9 F12

Q2ASCPU Q2ASCPU-S1 Q2ASHCPU Q2ASHCPU-S1 User’s …Q2ASCPU-S1,Q2ASHCPU… · Q2ASCPU Q2ASCPU-S1 Q2ASHCPU Q2ASHCPU-S1 Remark Control method Repetitive operation of stored program

ORIGINAL EQUIPMENT & AFTERMARKET … · original equipment & aftermarket interchange ... n83-304132 ..... 5715 n83-304133 ... 5318 n83-304552

S 1G W2 G1 S4 S1 S4 S1 L4 S2 S1 S2 L4 · 2018. 9. 18. · G1 S1 S2 S4 S4 S4 S1 S2 S1 S1 S1 S3 L4 S1 G1 S3 S1 S3 S3 3 S1 G1 S1 W2 S3 L4 S1 S3 X2S4 S3 S 1G V o r a b z u g AUFTRAGGEBER

index []...p 104—109 comp. 190 p 110—115 comp. 191 p 116—121 comp. 192 p 122—127 comp. 193 p 128—133 comp. 194 p 134—139 comp. 195 p 140—147 comp. 196 p 148—153 comp

ADS-5318-01 - MyRheem.com

COMP 5318 –Data Exploration and Analysis

ISSN 0869 5318 ТЕхНОЛОгИЯ ЧИСТОТЫ · 2017. 8. 10. · admin@invar-project.ru, Проектирование производств. Поставка оборудования

Supporting Information Figs S1–S8, Tables S1–S4 2 3 Fig. S1 · 1 1 Supporting Information Figs S1–S8, Tables S1–S4 2 3 Fig. S1 4 5 Fig. S1 Phenotypes of wild type and wp1

GO Comp Comp

GACETA 5318.PDF

Document S1. Figures S1–S4, Tables S1 and S2, and Supplemental

S1 01 FIRST FLR FRAMING PLAN-Default - Christenson Corporation Files/S1... · full-height cabinets comp dry comp wash uc ref boulder wall dumb waiter under-slab storage tank c o l

ELEE 5318/6352 Special Problems-Part I

Full Content List, Netherlands...Buffy The Vampire Slayer (S1-7) Castle (S1-8) Cleveland Show, The (S1-4) Cougar Town (S1 -6) Desperate Housewives (S1-8) Devious Maids (S1-4) Family

index [] · index p 02—09 comp. 175 p 10—19 comp. 176 p 20—25 comp. 177 p 26—31 comp. 178 p 32—37 comp. 179 p 38—43 comp. 180 p 44—49 comp. 181 p 50—55 comp. 182 p

S 1G W2 G1 S4 S1 S4 S1 L4 S2 S1 S2 L4 · 2018. 10. 9. · G1 S1 S2 S4 S4 S4 S1 S2 S1 S1 S1 S3 L4 S1 G1 S3 S1 S3 S3 3 S1 G1 S1 W2 S3 L4 S1 S3 X2S4 S3 S 1G V o r a b z u g AUFTRAGGEBER

CP English 09 - Weeblybraunpaliszewski.weebly.com/uploads/2/4/4/8/24486951/master_sch… · CP American Lit & Comp (1154) EN #2 EN 1.0 BROWN 114 S1 CP American Lit & Comp (1154) PLAN

S1 HP _ S1 Lang A and A

Document S1. Figs. S1–S8