CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR...

CS5614:(Big)DataManagementSystems

B.AdityaPrakashLecture#16:Clustering

HighDimensionalData

§  Givenacloudofdatapointswewanttounderstanditsstructure

VTCS5614 2Prakash2017

TheProblemofClustering§  Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothat– Membersofaclusterareclose/similartoeachother– Membersofdifferentclustersaredissimilar

§  Usually:– Pointsareinahigh-dimensionalspace– Similarityisdefinedusingadistancemeasure

•  Euclidean,Cosine,Jaccard,editdistance,…

VTCS5614Prakash2017

Example:Clusters&Outliers

x x x x x x x x x x x x x

x xx x x x

x x x x

x x x x x x x x x

VTCS5614

x xx x x x

x x x x

x x x x x x x x x

x Outlier Cluster

Prakash2017

Clusteringisahardproblem!

Whyisithard?

§  Clusteringintwodimensionslookseasy§  Clusteringsmallamountsofdatalookseasy§  Andinmostcases,looksarenotdeceiving

§  ManyapplicaDonsinvolvenot2,but10or10,000dimensions

§  High-dimensionalspaceslookdifferent:Almostallpairsofpointsareataboutthesamedistance

VTCS5614Prakash2017

ClusteringProblem:Galaxies§  Acatalogof2billion“skyobjects”representsobjectsby

theirradiaWonin7dimensions(frequencybands)§  Problem:Clusterintosimilarobjects,e.g.,galaxies,nearby

stars,quasars,etc.§  SloanDigitalSkySurvey

ClusteringProblem:MusicCDs§  IntuiWvely:Musicdividesintocategories,andcustomerspreferafewcategories– Butwhatarecategoriesreally?

§  RepresentaCDbyasetofcustomerswhoboughtit:

§  SimilarCDshavesimilarsetsofcustomers,andvice-versa

8VTCS5614Prakash2017

ClusteringProblem:MusicCDsSpaceofallCDs:§  Thinkofaspacewithonedim.foreachcustomer– Valuesinadimensionmaybe0or1only– ACDisapointinthisspace(x1,x2,…,xk),wherexi=1ifftheithcustomerboughttheCD

§  ForAmazon,thedimensionistensofmillions

§  Task:FindclustersofsimilarCDsVTCS5614 9Prakash2017

ClusteringProblem:Documents

Findingtopics:§  Representadocumentbyavector(x1,x2,…,xk),wherexi=1ifftheithword(insomeorder)appearsinthedocument–  Itactuallydoesn’tmaberifkisinfinite;i.e.,wedon’tlimitthesetofwords

§  Documentswithsimilarsetsofwordsmaybeaboutthesametopic

Cosine,Jaccard,andEuclidean§  AswithCDswehaveachoicewhenwethinkofdocumentsassetsofwordsorshingles:– Setsasvectors:Measuresimilaritybythecosinedistance

– Setsassets:MeasuresimilaritybytheJaccarddistance

– Setsaspoints:MeasuresimilaritybyEuclideandistance

Overview:MethodsofClustering

§  Hierarchical:– AgglomeraWve(bobomup):

•  IniDally,eachpointisacluster•  Repeatedlycombinethetwo“nearest”clustersintoone

– Divisive(topdown):•  Startwithoneclusterandrecursivelysplitit

§  Pointassignment:– Maintainasetofclusters–  Pointsbelongto“nearest”cluster

VTCS5614Prakash2017

HierarchicalClustering

§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  ThreeimportantquesWons:– 1)Howdoyourepresentaclusterofmorethanonepoint?

– 2)Howdoyoudeterminethe“nearness”ofclusters?

– 3)Whentostopcombiningclusters?

HierarchicalClustering§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  (1)Howtorepresentaclusterofmanypoints?–  Keyproblem:Asyoumergeclusters,howdoyourepresentthe“locaDon”ofeachcluster,totellwhichpairofclustersisclosest?

§  Euclideancase:eachclusterhasacentroid=averageofits(data)points

§  (2)Howtodetermine“nearness”ofclusters?– Measureclusterdistancesbydistancesofcentroids

Example:Hierarchicalclustering

Prakash2017 VTCS5614 15

AndintheNon-EuclideanCase?WhatabouttheNon-Euclideancase?§  Theonly“locaDons”wecantalkaboutarethepointsthemselves–  i.e.,thereisno“average”oftwopoints

§  Approach1:–  (1)Howtorepresentaclusterofmanypoints?clustroid=(data)point“closest”tootherpoints

–  (2)Howdoyoudeterminethe“nearness”ofclusters?Treatclustroidasifitwerecentroid,whencompuDnginter-clusterdistances

“Closest”Point?§  (1)Howtorepresentaclusterofmanypoints?clustroid=point“closest”tootherpoints

§  Possiblemeaningsof“closest”:–  Smallestmaximumdistancetootherpoints–  Smallestaveragedistancetootherpoints–  Smallestsumofsquaresofdistancestootherpoints

•  FordistancemetricdclustroidcofclusterCis:

VTCS5614 17

∑∈Cxc

cxd 2),(min

Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.

Cluster on 3 datapoints

Centroid

Clustroid

Datapoint

Prakash2017

Defining“Nearness”ofClusters

§  (2)Howdoyoudeterminethe“nearness”ofclusters?– Approach2:Interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster

– Approach3:PickanoDonof“cohesion”ofclusters,e.g.,maximumdistancefromtheclustroid

•  Mergeclusterswhoseunionismostcohesive

Cohesion

§  Approach3.1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster

§  Approach3.2:Usetheaveragedistancebetweenpointsinthecluster

§  Approach3.3:Useadensity-basedapproach– Takethediameteroravg.distance,e.g.,anddividebythenumberofpointsinthecluster

ImplementaWon

§  NaïveimplementaWonofhierarchicalclustering:– Ateachstep,computepairwisedistancesbetweenallpairsofclusters,thenmerge

– O(N3)

§  CarefulimplementaDonusingpriorityqueuecanreduceDmetoO(N2logN)– SWlltooexpensiveforreallybigdatasetsthatdonotfitinmemory

K-MEANSCLUSTERING

k–meansAlgorithm(s)

§  AssumesEuclideanspace/distance

§  Startbypickingk,thenumberofclusters

§  IniDalizeclustersbypickingonepointpercluster– Example:Pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints

PopulaWngClusters§  1)Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest

§  2)Alerallpointsareassigned,updatethelocaDonsofcentroidsofthekclusters

§  3)Reassignallpointstotheirclosestcentroid–  SomeDmesmovespointsbetweenclusters

§  Repeat2and3unWlconvergence–  Convergence:Pointsdon’tmovebetweenclustersandcentroidsstabilize

Example:AssigningClusters

VTCS5614 24

x … data point … centroid

Clusters after round 1 Prakash2017

VTCS5614 25

Clusters after round 2 Prakash2017

VTCS5614 26

Clusters at the end Prakash2017

Gegngthekright

Howtoselectk?§  Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases

§  AveragefallsrapidlyunDlrightk,thenchangeslible

Average distance to

centroid

Best value of k

VTCS5614Prakash2017

Example:Pickingk

VTCS5614 28

x xx x x x

x x x x

x x x x x x x x x

Too few; many long distances to centroid.

Prakash2017

Example:Pickingk

VTCS5614 29

x xx x x x

x x x x

x x x x x x x x x

Just right; distances rather short.

Prakash2017

Example:Pickingk

VTCS5614 30

x xx x x x

x x x x

x x x x x x x x x

Too many; little improvement in average distance.

Prakash2017

THEBFRALGORITHM

Extensionofk-meanstolargedata

BFRAlgorithm§  BFR[Bradley-Fayyad-Reina]isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets

§  AssumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace–  StandarddeviaDonsindifferentdimensionsmayvary

•  Clustersareaxis-alignedellipses

§  Efficientwaytosummarizeclusters(wantmemoryrequiredO(clusters)andnotO(data))

BFRAlgorithm§  Pointsarereadfromdiskonemain-memory-fullataDme

§  MostpointsfrompreviousmemoryloadsaresummarizedbysimplestaWsWcs

§  Tobegin,fromtheiniDalloadweselecttheiniDalkcentroidsbysomesensibleapproach:–  Takekrandompoints–  TakeasmallrandomsampleandclusteropDmally–  Takeasample;pickarandompoint,andthenk–1morepoints,eachasfarfromthepreviouslyselectedpointsaspossible

ThreeClassesofPoints3setsofpointswhichwekeeptrackof:§  Discardset(DS):

–  Pointscloseenoughtoacentroidtobesummarized

§  Compressionset(CS):–  GroupsofpointsthatareclosetogetherbutnotclosetoanyexisDngcentroid

–  Thesepointsaresummarized,butnotassignedtoacluster

§  Retainedset(RS):–  IsolatedpointswaiDngtobeassignedtoacompressionset

BFR:“Galaxies”Picture

VTCS5614 35

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

SummarizingSetsofPointsForeachcluster,thediscardset(DS)issummarizedby:§  Thenumberofpoints,N§  ThevectorSUM,whoseithcomponentisthesumofthecoordinatesofthepointsintheithdimension

§  ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension

VTCS5614

36A cluster. All its points are in the DS. The centroid

Prakash2017

SummarizingPoints:Comments

§  2d+1valuesrepresentanysizecluster–  d=numberofdimensions

§  Averageineachdimension(thecentroid)canbecalculatedasSUMi/N–  SUMi=ithcomponentofSUM

§  Varianceofacluster’sdiscardsetindimensioniis:(SUMSQi/N)–(SUMi/N)2–  AndstandarddeviaDonisthesquarerootofthat

§  Nextstep:Actualclustering

37VTCS5614

Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! Prakash2017

The“Memory-Load”ofPoints

Processingthe“Memory-Load”ofpoints(1):§  1)Findthosepointsthatare“sufficientlyclose”toaclustercentroidandaddthosepointstothatclusterandtheDS–  Thesepointsaresoclosetothecentroidthattheycanbesummarizedandthendiscarded

§  2)Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS–  ClustersgototheCS;outlyingpointstotheRS

VTCS5614 38

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

The“Memory-Load”ofPointsProcessingthe“Memory-Load”ofpoints(2):§  3)DSset:AdjuststaDsDcsoftheclusterstoaccountforthenewpoints–  AddNs,SUMs,SUMSQs

§  4)ConsidermergingcompressedsetsintheCS

§  5)Ifthisisthelastround,mergeallcompressedsetsintheCSandallRSpointsintotheirnearestcluster

VTCS5614 39

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

BFR:“Galaxies”Picture

VTCS5614 40

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

AFewDetails…

§  Q1)Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?

§  Q2)Howdowedecidewhethertwocompressedsets(CS)deservetobecombinedintoone?

HowCloseisCloseEnough?

§  Q1)Weneedawaytodecidewhethertoputanewpointintoacluster(anddiscard)

§  BFRsuggeststwoways:–  TheMahalanobisdistanceislessthanathreshold–  Highlikelihoodofthepointbelongingtocurrentlynearestcentroid

MahalanobisDistance

§  Ifclustersarenormallydistributedinddimensions,thenalertransformaDon,onestandarddeviaDon=√𝒅 –  i.e.,68%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√𝒅 

§  AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.2standarddeviaDons

Picture:EqualM.D.Regions§  Euclideanvs.Mahalanobisdistance

VTCS5614 45

Contours of equidistant points from the origin

Uniformly distributed points, Euclidean distance

Normally distributed points, Euclidean distance

Normally distributed points, Mahalanobis distance

Prakash2017

Should2CSclustersbecombined?

Q2)Should2CSsubclustersbecombined?§  Computethevarianceofthecombinedsubcluster

–  N,SUM,andSUMSQallowustomakethatcalculaDonquickly

§  Combineifthecombinedvarianceisbelowsomethreshold

§  ManyalternaWves:Treatdimensionsdifferently,considerdensity

THECUREALGORITHM

Extensionofk-meanstoclustersofarbitraryshapes

TheCUREAlgorithm§  ProblemwithBFR/k-means:

–  Assumesclustersarenormallydistributedineachdimension

–  Andaxesarefixed–ellipsesatananglearenotOK

§  CURE(ClusteringUsingREpresentaWves):–  AssumesaEuclideandistance–  Allowsclusterstoassumeanyshape–  UsesacollecWonofrepresentaWvepointstorepresentclusters

VTCS5614Prakash2017

Example:UniversitySalaries

salary

VTCS5614Prakash2017

StarWngCURE2Passalgorithm.Pass1:§  0)Pickarandomsampleofpointsthatfitinmainmemory

§  1)IniWalclusters:–  Clusterthesepointshierarchically–groupnearestpoints/clusters

§  2)PickrepresentaWvepoints:–  Foreachcluster,pickasampleofpoints,asdispersedaspossible

–  Fromthesample,pickrepresentaDvesbymovingthem(say)20%towardthecentroidofthecluster

Example:IniWalClusters

salary

VTCS5614Prakash2017

Example:PickDispersedPoints

salary

Pick (say) 4 remote points for each cluster.

VTCS5614Prakash2017

Example:PickDispersedPoints

salary

Move points (say) 20% toward the centroid.

VTCS5614Prakash2017

FinishingCURE

Pass2:§  Now,rescanthewholedatasetandvisiteachpointpinthedataset

§  Placeitinthe“closestcluster”– NormaldefiniDonof“closest”:FindtheclosestrepresentaDvetopandassignittorepresentaDve’scluster

54VTCS5614

Prakash2017

Summary§  Clustering:Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters

§  Algorithms:– AgglomeraDvehierarchicalclustering:

•  Centroidandclustroid– k-means:

•  IniDalizaDon,pickingk– BFR– CURE

CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR...

Documents

CREATE TABLE Students - Undergraduate Courses ...courses.cs.vt.edu/~cs5614/lectures/module2.pdfCS 5614: Basic Data Definition and Modification in SQL 71 Modifying Relation Schemas

Personalizing Interactions with Information Systemspeople.cs.vt.edu/~ramakris/papers/piis.pdfPersonalization entails customizing information access, structure, and presentation to

SPR17 ECD Theology 1 · Web viewKurios (Kuvrio") This name is derived from kuros, “power” and is generally translated “Lord.” Kurios designates God as the mighty One and

Analyzing Payment-driven Targeted Q&A Systemspeople.cs.vt.edu/tekang/papers/TSC_2018.pdf · Analyzing Payment-driven Targeted Q&A Systems STEVE T.K. JAN, Virginia Tech, USA CHUN WANG,

CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

INVITATION TO C OMMENTINVITATION TO C OMMENT SPR17-Title Juvenile Law: Psychotropic Medication Proposed Rules, Forms, Standards, or Statutes Amend Cal. Rules of Court, rule 5.640;

AN INTRODUCTION TO ERROR CORRECTING CODES …circuit.ucsd.edu/~yhk/ece154c-spr17/pdfs/ErrorCorrectionI.pdf · NATURE’S ERROR CONTROL ... RNA-Amino Acid Coding AUG starts codon

Homework 4: Clustering, Recommenders, Dim. Reduction, ML ...people.cs.vt.edu/.../homeworks/hw4/HW4.pdf · Homework 4: Clustering, Recommenders, Dim. Reduction, ML ... badityap/classes/cs5614-Fall14/homeworks/hw4

CS5614:(Big)Data# Management#Systems#people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-5.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash# Lecture’#5:’Hashing’and’Sor5ng’

*PQ spr17 38-73.qxp Layout 1 2/8/17 2:05 PM Page 54 › wp-con… · prints decorate the walls, the chairs testify to many hours spent resting by the 43 working fireplaces, and a

CS313D: ADVANCED PROGRAMMING LANGUAGE · 2/7/2017 1 CS313D: ADVANCED PROGRAMMING LANGUAGE Lecture 3: C# language basics II Lecture Contents Dr. Amal Khalifa, Spr17 C# basics Methods

CalPoly OrfaleaMagGuts SPR17 R8€¦ · of Marketing in the spring and fall of 2016 to test the impact of peer-mentorship on the learning process. The program matched seniors concentrating

Lecture 1: Mobility Management in Mobile Wireless Systemspeople.cs.vt.edu/~irchen/6204/paper/lecture1-mobility-management.… · Lecture 1: Mobility Management in Mobile Wireless

Peer-to-Peer Systems and Distributed Hash Tables › courses › archive › spr17 › cos... · 2017-04-16 · 1.Peer-to-Peer Systems – Napster, Gnutella, BitTorrent, challenges

ARTH sec1A SPR17 · social, political, and economic life of its culture, ... Mesopotamia Egypt Greece Rome } PREHISTORIC } In prehistoric art, the term pictograph or pictogram describes

Pertumbuhan Kultur Tunggal dan Campur Bakteri SPR3 dan ...repository.uksw.edu/bitstream/123456789/8523/2/T1_412020005_Full text.pdf · Tunggal dan Campur Bakteri SpR3 dan SpR17 dan

FACULTY FOCUS CLASS NOTES S NATIONAL SECURITY ...content.law.virginia.edu/system/files/news/spr17/faculty-news...UVA LAWYER SPRING 2017 SPRING 2017 UVA LAWYER . law. FACULTY . by a

Ch1 PP SetTheory - math.drexel.edujsteuber/spr17/PP/Ch1_PP_SetTheory.… · 1. Consider the matrices A= 0−2 −68 −75 ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥, ⎥ ⎦ ⎤ ⎢

INVITATION TO C OMMENT - California Courts · 2019-05-08 · INVITATION TO C OMMENT SPR17-22 Title Protective Orders: Modification and Termination Proposed Rules, Forms, Standards,

FOR MEETING OF: July 12, 2017 CASE NO.: CU-SPR17-08 …€¦ · · 2017-07-05(OLCC), and complying with ... production process described in the applicant's operating plan does not