30
WEKA: A MODERN APPLICATION OF DATA MINING TECHNIQUES SEAN,ROB,PRATIK,RHODRI,AL, VASANTI,MINGHAO

Weka bike rental

Embed Size (px)

Citation preview

WEKA: A MODERN APPLICATION OF DATA MINING TECHNIQUES

SEAN,ROB,PRATIK,RHODRI,AL, VASANTI,MINGHAO

What is WEKA?• Desktop application for machine learning & data mining

• Open source Java based tool

• Offers commonly used algorithms to model data.

• University of Waikato, New Zealand

What is Data Mining & Machine Learning?• Data Mining :

• Searching for patterns in data• Finding value in data

• Machine Learning:• Developing models which computational resources can use• Using computational resources to model data to predict a likely

outcome.

Features of WEKA• Pre-process data

• Classification & Clustering

• Association rules

• 3D visualisation

Choosing the Dataset• Public datasets:

•data.gov.uk•kaggle.com: such as Titanic dataset•UCI Machine Learning Repository

• Dataset which could provide insight to a real world scenario

• Would model effectively in WEKA: several properties

Capital Bikeshare

Picture: Alejandro Castro, flickr, creative commons

• Bike-share system in Washington DC and surrounding area

• https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

The Objective

• Investigate factors affecting bike-share usage

• Could this data be used to predict how busy or quiet a bike share system may be on a given day?

Dataset fields• Record index

• Time information•Date, day of the week, whether day is holiday, whether day is working day, month, year, season (1-4 spring/summer/autumn/winter)

•Weather Information•weather description (separated into four distinct results which are roughly good to bad)

•normalized values for temperature, ‘feels like’ temperature, humidity and windspeed

•Totals•counts for bikes rented by registered and casual users•total count for bikes registered that day

Pre-processing• Remove fields which don’t help prediction

•indexes, sub-totals etc

• Filters• Discretize - categorise into discrete values• ClassBalancer - re-weights instances so more evenly spread

Data Visualisation

Basic terminology to understand evolution of classifiers•True positive(tp): An instance is correctly predicted to belong to the given class•True negative(tn): An instance is correctly predicted not to belong to the given class•False positive(fp): An instance is incorrectly predicted to belong to the given class•False negative(fn): An instance is incorrectly predicted not to belong to the given class

Explanation of Statistics

• Precision:

• Recall:

• F-measure:

Algorithms exploredGraph based:

• J48 - This classifier uses a tree structure to make decisions.

•Performs very good for our dataset

Algorithms exploredRule based :

• ZeroR - ZeroR is the simplest classification method which relies on the target and ignores all predictors.

•Not good for our dataset

Algorithms exploredNaïve Bayes•This is a probabilistic classifier based on Bayes Theorem which analyses the relationship between features and class labels. •. This classifier can handle missing values by ignoring them during calculation of the conditional probabilities.

Testset DivisionTraining and Testing set:-Training data is used for building a ML model -Testing data is used for measuring performance of a ML model

Supplying testing set in WekaSeparate training and testing

Testset DivisionCross Validation:-To overcome the problem of overfitting -Makes the predictions more general •Includes: -Splitting the original dataset into k equal parts (folds) -Takes out one fold aside, and performs training over the rest k-1

folds and measures the performance-Repeats the process k times by taking different fold each time.•10-fold cross-validation : k = 10

Testset DivisionPercentage split-Randomly split your dataset into a training and a testing partitions each time you evaluate a model. Dividing original dataset into testing and training

For example:If we have a data of 100 instances and we would like to split 66% as training and 34% as test set using percentage split

What is Clustering? • Finding the class labels and the number of classes directly from the data (in contrast to classification).

• It is unsupervised learning: We want to explore the data to find some structures in them.

What is clustering for? ● Grouping items of similar properties together into

clusters.

● For example to apply machine learning approaches to make decisions based on data e.g. for classifying : “small”, “medium” and “large” T-Shirts.

Clustering types:

Clustering types:

Some popular Clustering Algorithms•K- means clustering (disjoint sets) •EM clustering (probabilistic)•Cobweb clustering (hierarchical)

KMeans: Iterative distance-based clustering (disjoint sets)1. Specify k, the desired number of clusters2. Choose k points at random as cluster centers3. Assign all instances to their closest cluster center4. Calculate the centroid (i.e., mean) of instances in each cluster5. These centroids are the new cluster centers6. Continue until the cluster centers don’t changeMinimizes the total squared distance from instances to their cluster centers.

K-means in Weka•Note parameters:

• numClusters •distanceFunction

How can we tell the right number of clusters?In general, this is an unsolved problem

Clustering is subjective

•Use the AddCluster unsupervised attribute filter

•Hard to evaluate clustering

Trying to cluster into seasonsUsing K-means clustering, with k=4, we wish to see if the data falls into the clusters based on the seasons

Observations• We found that winter and summer months have

separated into two distinct clusters. • The autumn and spring months have not separated so well.

• From the visualisation we also see the overall trend of more users in the summer months compared to winter ones.

• This is not surprising since these months are hotter and people are more likely to choose to rent bikes.

Possible Improvements• Data accuracy

• Uncontrollable outside factors e.g. road closures,cycle paths built,tube strikes etc.•As popularity increases -> may affect results.

• Data precision • Bad measurements, subjective opinions(weather): generalised - exact calculations needed.• Variable factors e.g. “temperature or weather” is different depending on exact location.

• Data itself always changing: only an indicator of some relationships.• Different people: e.g. tourists – different people may have different attitudes• Different locations yield different results: weather is variable across continents.

Evaluation of best approach• J48 - easy to visualise

• Zero R is a bad idea for our dataset

Overall : the best approach is to analyse several different WEKA modules and compare results to focus efforts and find the best solution.

• Graphs of properties: can indicate most important factors to be classified

• Classification algorithms: to build a model• Testing the model is also crucial.

Conclusions based on data• Dataset suitability - probably more suited to

classification than clustering

• Some prediction was possible• External factors - other changes in the transport

network, cycling for health, city events

• Other possible analysis: usage by hour, casual users

• Applications: Smart cities & planning - effective bikeshare provision