Upload
cameron-johns
View
212
Download
0
Embed Size (px)
Citation preview
Data Mining Algorithms for Large-Scale Distributed Systems
Presenter: Ran WolffJoint work with Assaf Schuster2003
What is Data Mining?
The automatic analysis of large databaseThe discovery of previously unknown patternsThe generation of a model of the data
Main Data Mining Problems
Association rules Description
Classification Fraud, Churn
Clustering Analysis
He who does this and that will usually do some other thing too
These attributes indicate a good behavior - those indicate bad behavior.
There are three types of entities
Examples – Classification
Customers purchase artifacts in a storeEach transaction is described in terms of a vector of featuresThe owner of the store tries to predict which transactions are fraudulent Example: young men who buy small
electronics during rash-hours Solution: do not respect checks
Examples – Associations
Amazon tracks user queries Suggests to each user additional
books he would usually be interested in
Supermarket finds out “people who buy diapers also buy beer” Place diapers and beer at opposite
sides of the supermarket
Examples – Clustering
Resource location Find the best location for k
distribution centers
Feature selection Find 1000 concepts which summarize
a whole dictionary Extract the meaning out of a
document by replacing each work with the appropriate conceptCar for auto, etc.
Why Mine Data of LSD Systems?
Data mining is goodIt is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collectMany interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes
An Example
Peers in the Kazza network reveal to the system which files they have on their disks in exchange to access to the files of their peersThe result is a 2M peers database of people recreational preferencesMining it, you could discover that Matrix fans are also keen of Radio-Head songs Promote RH performances in Matrix-
Reloaded Ask RH to write the music for Matrix-IV
What is so special about this problem?
Huge systems – Huge amounts of dataDynamic setting System – join / depart Data – constant update
Ad-hoc solutionFast convergence
Our Work
We developed an association rule mining algorithm that works well in LSD Systems Local and therefore scalable Asynchronous and therefore fast Dynamic and therefore robust Accurate – not approximated Anytime – you get early results fast
In a Teaspoon
A distributed data mining algorithm can be described as a series of distributed decisionsThose decisions are reduced to a majority voteWe developed a majority voting protocol which has all those good qualitiesThe outcome is an LSD association rule mining (still to come: classification)
Problem Definition – Association Rule Mining (ARM)
DBXFreqDBYXFreqDBYXConf
DBDBXSupportDBXFreq
TXDBTDBXSupport
TTTDB
IT
IX
iiiI
k
m
,,,
,,
:,
,...,,
,...,,
21
21
Solution to Traditional ARM
MinConfDBYXConf
MinFreqDBYXFreq
YX
YXDBR
MinConfMinFreqLet
,
,:
10,10
Large-Scale Distributed ARM
tuv
vtt
t
ut
DBRuR
tuvVvu
tuDBDB
at time from reachable is :
,:
Solution of LSD-ARM
No terminationAnytime solution
Recall
Precision
YXYXuR t :~
ttt uRuRuR ~
ttt uRuRuR~~
Majority Vote in LSD Systems
Unknown number of nodes vote 0 or 1 Nodes may dynamically change their vote Edges are dynamically added / removed An infra-structure
detects failureensures message integritymaintains a communication forest
Each node should decide if the global majority is of 0 or 1
Majority Vote in LSD Systems – cont.
Because of the dynamic settings, the algorithm never terminatesInstead we measure the percent of correct outputsIn static periods that percent ought to converge to 100%In stationary periods we will show it converges to a different percentage Assume the overall percentage of ones remains
the same, but they are constantly switched
LSD-Majority Algorithm
Nodes communicates by exchanging messages <s, c>Node u maintains: su – its vote, cu – one (for now) <suv, cuv>– the last <s,c> it had sent
to v <svu, cvu>– the last <s,c> it had
received from v
LSD-Majority – cont.
Node u calculates:
Captures the current knowledge of u
Captures the current agreement between u and v
uu Ev
vuu
Ev
vuuu ccss
uvvuuvvuuv ccss
LSD-Majority – Rational
It is OK if the current knowledge of u is more extreme than what it had agreed with vThe opposite is not OK v might assume u supports its decision
more strongly than u actually does
Tie breaking prefers a negative decision
LSD-Majority – The Protocol
v to, sendthen
and 0
or
and 0
either and 0
or 0 and 0 If
uu Evuwu
wuu
Evuwu
wuu
uvuuv
uvuuv
vuvu
uvuvu
ccss
cc
cc
LSD-Majority – The Protocol
The same decision is applied whenever a message is received su changes an edge fails or recovers
LSD-Majority – Example
LSD-Majority Results
Proof of Correctness
Will be given in class
Back from Majority to ARM
To decide whether an itemset is frequent or not
LSDMrun
set
,set
set
ut
u
ut
u
DBc
DBXSupports
MinFreq
Back from Majority to ARM
To decide whether a rule is confident or not
LSDMrun
,set
,set
set
ut
u
ut
u
DBXSupportc
DBYXSupports
MinConf
Additionally
Create candidates based on the ad-hoc solutionCreate rules on-the-fly rather than upon termination
Our algorithm outputs the correct rules without specifying their global frequency and confidence
Eventual Results
By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.