35
Data mining methodology in Weka Renata Benda Prokeinova Department of Statistics and Operation research

Data mining methodology in Weka Renata Benda Prokeinova Department of Statistics and Operation research

Embed Size (px)

Citation preview

Data mining methodology in Weka

Data mining methodology in WekaRenata Benda ProkeinovaDepartment of Statistics and Operation researchLogistic regression- teoryLogistic regression can in many ways be seen to be similar to ordinary regression.It models the relationship between a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the significance of the relationships (between dependent and independent variables) that we are modelling. However, the underlying principle of binomial logistic regression, and its statistical calculation, are quite different to ordinary linear regression. While ordinary regression uses ordinary least squares to find a best fitting line, and comes up with coefficients that predict the change in the dependent variable for one unit change in the independent variable, logistic regressionestimates the probability of an event occurring.What we want to predict from a knowledge of relevant independent variables is not a precise numerical value of a dependent variable, but rather the probability (p) that it is 1 (event occurring) rather than 0 (event not occurring). This means that, while in linear regression, the relationship between the dependent and the independent variables is linear, this assumption is not made in logistic regression. Logistic regression -outputFirst section of the report:Coefficients... ClassVariable yes===============================outlook=sunny -6.4257outlook=overcast 13.5922outlook=rainy -5.6562temperature -0.0776humidity -0.1556windy 3.7317Intercept 22.234Thecoefficients are in fact the weights that are applied to each attributebefore adding them together. However,the result is the probability that the new instance belongs to class yes(> 0.5 means yes).Logistic regression -outputOdds Ratios... ClassVariable yes===============================outlook=sunny 0.0016outlook=overcast 799848.4264outlook=rainy 0.0035temperature 0.9254humidity 0.8559windy 41.7508The odds ratios indicate how large of an influence a change in that value (or change to that value) will have on the prediction. I think thislinkdoes a great job explaining the odds ratios. The value of outlook=overcast is so large because if the outlook is overcast the odds are very good that play will equal yes.Logistic regression -output=== Confusion Matrix === a b