Romi Dm 04 Algoritma June2012

romi-jsai2000-presentation

Data Mining:Metode dan AlgoritmaRomi Satria [email protected]://romisatriawahono.net+6281586220090

SD Sompok Semarang (1987)SMPN 8 Semarang (1990)SMA Taruna Nusantara, Magelang (1993)S1, S2 dan S3 (on-leave)Department of Computer SciencesSaitama University, Japan (1994-2004)Research Interests: Software Engineering and Intelligent SystemsFounder IlmuKomputer.ComPeneliti LIPI (2004-2007)Founder dan CEO PT Brainmatics Cipta InformatikaRomi Satria WahonoCourse OutlinePengenalan Data MiningProses Data MiningEvaluasi dan Validasi pada Data MiningMetode dan Algoritma Data MiningPenelitian Data [email protected] Programminghttp://romisatriawahono.net3Metode dan AlgoritmaMetode dan AlgoritmaInferring rudimentary rulesStatistical modelingConstructing decision treesConstructing rulesAssociation rule learningLinear modelsInstance-based learningClusteringSimplicity firstSimple algorithms often work very well!There are many kinds of simple structure, eg:One attribute does all the workAll attributes contribute equally & independentlyA weighted linear combination might doInstance-based: use a few prototypesUse simple logical rulesSuccess of method depends on the domain

Inferring rudimentary rules1R: learns a 1-level decision treeI.e., rules that all test one particular attributeBasic versionOne branch for each valueEach branch assigns most frequent classError rate: proportion of instances that dont belong to the majority class of their corresponding branchChoose attribute with lowest error rate(assumes nominal attributes)

Pseudo-code for 1RNote: missing is treated as a separate attribute value

For each attribute,For each value of the attribute, make a rule as follows:count how often each class appearsfind the most frequent classmake the rule assign that class to this attribute-valueCalculate the error rate of the rulesChoose the rules with the smallest error rateEvaluating the weather attributes3/6True No*5/142/8False YesWindy1/7Normal Yes4/143/7High NoHumidity5/144/14Total errors1/4Cool Yes2/6Mild Yes2/4Hot No*Temp2/5Rainy Yes0/4Overcast Yes2/5Sunny NoOutlookErrorsRulesAttributeNoTrueHighMildRainyYesFalseNormalHotOvercastYesTrueHighMildOvercastYesTrueNormalMildSunnyYesFalseNormalMildRainyYesFalseNormalCoolSunnyNoFalseHighMildSunnyYesTrueNormalCoolOvercastNoTrueNormalCoolRainyYesFalseNormalCoolRainyYesFalseHighMildRainyYesFalseHighHot OvercastNoTrueHighHotSunnyNoFalseHighHotSunnyPlayWindyHumidityTempOutlookDealing with numeric attributesDiscretize numeric attributesDivide each attributes range into intervalsSort instances according to attributes valuesPlace breakpoints where class changes (majority class)This minimizes the total errorExample: temperature from weather data

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | NoYesFalse8075RainyYesFalse8683OvercastNoTrue9080SunnyNoFalse8585SunnyPlayWindyHumidityTemperatureOutlookThe problem of overfittingThis procedure is very sensitive to noiseOne instance with an incorrect class label will probably produce a separate intervalAlso: time stamp attribute will have zero errorsSimple solution:enforce minimum number of instances in majority class per intervalExample (with min = 3):

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes NoWith overfitting avoidanceResulting rule set:

0/1> 95.5 Yes3/6True No*5/142/8False YesWindy2/6> 82.5 and 95.5 No3/141/7 82.5 YesHumidity5/144/14Total errors2/4> 77.5 No*3/10 77.5 YesTemperature2/5Rainy Yes0/4Overcast Yes2/5Sunny NoOutlookErrorsRulesAttributeDiscussion of 1R1R was described in a paper by Holte (1993)Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data)Minimum number of instances was set to 6 after some experimentation1Rs simple rules performed not much worse than much more complex decision treesSimplicity first pays off!

Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsRobert C. Holte, Computer Science Department, University of OttawaDiscussion of 1R: HyperpipesAnother simple technique: build one rule for each classEach rule is a conjunction of tests, one for each attributeFor numeric attributes: test checks whether instance's value is inside an intervalInterval given by minimum and maximum observed in training dataFor nominal attributes: test checks whether value is one of a subset of attribute valuesSubset given by all possible values observed in training dataClass with most matching tests is predicted

Statistical modelingOpposite of 1R: use all the attributesTwo assumptions: Attributes areequally importantstatistically independent (given the class value)I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)Independence assumption is never correct!But this scheme works well in practice

Probabilities for weather data 5/145No9/149YesPlay3/52/532No3/96/936YesTrueFalseTrueFalseWindy1/54/514NoYesNoYesNoYes6/93/963NormalHighNormalHighHumidity1/52/52/51223/94/92/9342Cool2/53/9RainyMildHotCoolMildHotTemperature0/54/9Overcast3/52/9Sunny23Rainy04Overcast32SunnyOutlookNoTrueHighMildRainyYesFalseNormalHotOvercastYesTrueHighMildOvercastYesTrueNormalMildSunnyYesFalseNormalMildRainyYesFalseNormalCoolSunnyNoFalseHighMildSunnyYesTrueNormalCoolOvercastNoTrueNormalCoolRainyYesFalseNormalCoolRainyYesFalseHighMildRainyYesFalseHighHot OvercastNoTrueHighHotSunnyNoFalseHighHotSunnyPlayWindyHumidityTempOutlookProbabilities for weather data5/145No9/149YesPlay3/52/532No3/96/936YesTrueFalseTrueFalseWindy1/54/514NoYesNoYesNoYes6/93/963NormalHighNormalHighHumidity1/52/52/51223/94/92/9342Cool2/53/9RainyMildHotCoolMildHotTemperature0/54/9Overcast3/52/9Sunny23Rainy04Overcast32SunnyOutlook?TrueHighCoolSunnyPlayWindyHumidityTemp.OutlookA new day:Likelihood of the two classesFor yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206Conversion into a probability by normalization:P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795Bayess ruleProbability of event H given evidence E:

A priori probability of H :Probability of event before evidence is seenA posteriori probability of H :Probability of event after evidence is seen

Thomas BayesBorn:1702 in London, EnglandDied:1761 in Tunbridge Wells, Kent, England

Nave Bayes for classificationClassification learning: whats the probability of the class given an instance?Evidence E = instanceEvent H = class value for instanceNave assumption: evidence splits into parts (i.e. attributes) that are independent

Weather data example?TrueHighCoolSunnyPlayWindyHumidityTemp.OutlookEvidence EProbability ofclass yesCognitive Assignment IIPahami dan kuasai satu metode data mining dari berbagai literaturRangkumkan dengan detail dalam bentuk slide, dari mulai definisi, perkembangan metode, tahapan algoritma, penerapannya untuk suatu studi kasus, dan buat/temukan code Java (eksekusi contoh kasus)Presentasikan di depan kelas pada mata kuliah berikutnya dengan bahasa manusiaPilihan Algoritma atau MetodeNeural NetworkSupport Vector MachineNaive BayesK-Nearest NeighborCARTLinear Discriminant AnalysisAgglomerative ClusteringSupport Vector RegressionExpectation MaximizationC4.5K-Means

Self-Organizing MapFP-GrowthA PrioriLogistic RegressionRandom ForestK-MedoidsRadial Basis FunctionFuzzy C-MeansK*Support Vector ClusteringOneR

ReferensiIan H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011 Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition, Elsevier, 2006Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, 2007 [email protected] Programminghttp://romisatriawahono.net23

Documents

Romi Dm 04 Algoritma June2012