Fundamentals of Predictive Analytics

Fundamentals of

Predictive Analytics A Business Analytics Course

University of the Philippines Open University

Dr. Eugene Rex Jalao Dr. Melinda Lumanta

Ms. Louise Antonette Villanueva Asst. Prof. Reinald Adrian Pugoy

Asst. Prof. Mar Cruz

Course Writers

Fundamentals of Predictive Analytics 1

University of the Philippines

OPEN UNIVERSITY

University of the Philippines

OPEN UNIVERSITY

COMMISSION ON HIGHER EDUCATION


UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY

Fundamentals of Predictive Analytics

A Business Analytics Course

Welcome, dear students! This course will help you traverse the world of predictive

analytics. In predictive analytics (also sometimes called data mining), useful patterns can

be extracted from data available which can, in turn, be utilized to predict the future.

Moreover, predictive analytics draws ideas from various fields such as machine learning,

pattern recognition, statistics, and database systems.

Before taking this course, you should have already completed the Fundamentals of

Descriptive Analytics course. And, one more thing! This very document is your course

guide. Kindly read this carefully before embarking on your journey to learn the

Fundamentals of Predictive Analytics.

COURSE OBJECTIVES

At the end of the course, you should be able to:

1. Discuss key concepts, frameworks, and algorithms in predictive analytics.

2. Identify appropriate predictive analytics/data mining methodologies for different

business problems.

3. Utilize appropriate predictive analytics/data mining methodologies to build and

validate models.

4. Evaluate the performance of the predictive model.

COURSE OUTLINE

MODULE 1: Introduction to Predictive Analytics 1. Definition of Predictive Analytics and Data Mining 2. Predictive Analytics Framework

MODULE2: Data Preprocessing

1. Definition of Data Preprocessing 2. Tasks of Data Preprocessing

MODULE 3: Supervised Learning

1. Classification Methodologies


2. Regression Methodologies 3. Evaluation and Deployment

MODULE 4: Unsupervised Learning 1. Association Rule Mining 2. Sequential Pattern Mining 3. Clustering 4. Text Mining 5. Social Media Sentiment Analysis

COURSE MATERIALS

The course learning package consists of the items listed below. These will be made

available for access and download.

1. Course Guide

2. Study Guides

3. Video Lectures/Resources

4. Other Digital References

STUDY SCHEDULE

Week

No.

Topic Activities

1-2 Course Overview;

Module 1: Introduction to

Predictive Analytics

• Read the Course Guide.

• Introduce yourself via the Self-

Introduction forum.

• Go through the Module 1 Study Guide.

• Watch the following video resources:

o Introduction to Predictive Analytics (ERL

Jalao)

o Supervised Learning vs Unsupervised

Learning (RADL Pugoy)

o Tools of Data Mining (ERL Jalao)

• Participate in Discussion Forum 1.

2-3 Module 2: Data

Preprocessing



o Data Preprocessing (ERL Jalao)


o Data Integration (ERL Jalao)

o Data Transformation (ERL Jalao)

o Data Encoding (ERL Jalao)

o Data Cleaning (ERL Jalao)

o Data Reduction and Manipulation (ERL

Jalao)

• Participate in Discussion Forum 2.

4-7 Module 3: Supervised

Learning (Classification)

• Go through the Module 3 Study Guide

(pp. 1-3).


o Classification (ERL Jalao)

o Naive Bayes (ERL Jalao)

o Decision Trees (ERL Jalao)

o Nearest Neighbours (ERL Jalao)

o Artificial Neural Networks (ERL Jalao)

o Support Vector Machines (ERL Jalao)

o Ensembles (ERL Jalao)

o Random Forests (ERL Jalao)

o Model Evaluation (ERL Jalao)

• Submit Assignment 1.

8-9 Module 3: Supervised

Learning (Regression)

• Go through the Module 3 Study Guide

(pp. 3-5).


o Regression (ERL Jalao)

o Regression Model Evaluation (ERL Jalao)

o Indicator Variables (ERL Jalao)

o Multicollinearity (ERL Jalao)

o Logistic Regression (ERL Jalao)


10-12 Module 4: Unsupervised

Learning



o Association Rule Mining (ERL Jalao)

o Sequential Pattern Mining (ERL Jalao)

o K-Means Clustering (ERL Jalao)

o Hierarchical Clustering (ERL. Jalao)

o Text Mining (ERL Jalao)

o Social Media Sentiment Analysis (ERL

Jalao)



13-14 Review and Integration Participate in Discussion Forum 3.

15 Final Exam Accomplish the Final Exam.

COURSE REQUIREMENTS

To earn a digital certificate of accomplishment, you need to satisfactorily complete the

following:

1. Three (3) discussion forums (20%)

Discussion forums or DFs are venues for you to share your thoughts and

learnings and show comprehension of the subject matter at hand. Guide

questions shall be posted for each DF. Each of you is expected to provide

insightful answers, with properly cited references.

2. Three (3) assignments (60%)

The goal of the assignments is for you to actually apply learned methodologies

and principles to the problem or issue at hand. An assignment may be a position

paper, modelling exercise, or programming assignment.

3. One (1) final exam (20%)

The final exam shall serve as your summative assessment, i.e, to evaluate

everything that you have learned all throughout this course.


MODULE 1: INTRODUCTION TO PREDICTIVE ANALYTICS

Introduction

This is the first module in the course. As such, it gives an overview of what the students

will be learning in this course as a whole, i.e., predictive analytics. A brief overview on

the principles pertaining to predictive analytics are defined and discussed.

Learning Objectives

After working on this module, you should be able to:

1. Define what predictive analytics is.

2. Discuss the fundamental ideas, concepts, and techniques associated with

predictive analytics.

3. Describe the Predictive Analytics Framework.

1.1. What is predictive analytics?

Predictive analytics (also sometimes called data mining) is the non-trivial extraction of

implicit, previously unknown, and potentially useful information from the data. In other

words, useful patterns are extracted from the data, and we hope that these patterns will

be repeated in the future. Another definition of data mining is that it is the exploration

and analysis of a large quantity of data to discover meaningful patterns by automatic or

semi-automatic means. Data mining is all about explaining the past to predict the future.

Predictive analytics draws ideas from various fields such as machine learning, pattern

recognition, statistics, and database systems.


Learning Resources

• Video on “Introduction to Predictive Analytics” by Dr. Eugene Rex Jalao. • Video on “Predictive Analytics” by Mr. Dominic Ligot.

Discussion Forum 1 Discuss the potential applications of predictive analytics/data mining in your field of work.

1.2. Supervised Learning vs Unsupervised Learning

Before proceeding to the latter parts of the course, it is important to differentiate

supervised learning and unsupervised learning. Supervised learning refers to the task of

inferring a function from supervised (or labeled) training data. Each example in the

training data is a pair, consisting of the input object and its corresponding label. On the

other hand, unsupervised learning pertains to the problem of trying find hidden

structures in unlabeled data. Since the examples are unlabeled, it can be more

challenging to find and evaluate a potential solution.

Learning Resources

Video on “Supervised Learning vs Unsupervised Learning” by Asst. Prof. Reinald Adrian Pugoy

Study Questions

1. How does the DBMS perform the functionalities listed in this module?

2. How do the different components of a database system relate to one another?


1.3. Tools of Data Mining

Data mining tools are software usually downloaded or bought from third-party providers.

An example of this tool is R, an open-source software that is considered to be the most

utilized tool for data mining and predictive analytics. Other tools include SPSS, Rapid

Miner, SAS, Excel, and Python, with Python coming closer to R nowadays. Furthermore,

it is important to take note that no tool solves all predictive analytics problems. In other

words, solutions for such problems cannot be implemented by just using a single

software. For this reason, a majority of companies worldwide use both free/open-source

software and commercial software.

Learning Resources

Video on “Tools of Data Mining” by Dr. Eugene Rex Jalao.

1.4. The Predictive Analytics Framework

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is a non-proprietary framework that allows us to implement predictive analytics solutions. It is a necessary standard process that ensures that data mining is reliable and can be easily repeated by people with little to no data mining background. It also demonstrates the maturity of data mining and reduces dependency on experts. Furthermore, the CRISP-DM Model also serves as an aid to project planning and management, and it is also considered a “comfort factors” for new adopters.

The CRISP-DM Model involves several phases. These are:

1. Business Understanding - The focus is on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.


2. Data Understanding - It starts with an initial data collection and familiarization. Data quality problems are identified to discover first insights into the data and to detect interesting subsets.

3. Data Preparation - This covers all pre-processing activities to construct the final dataset from the raw data. This process then allows the data to fit into the model that will be used to solve a particular problem.

4. Modeling - This phase defines the modeling techniques, applications, and parameters to calibrate specific models to solve the problem.

5. Evaluation - It identifies and verifies whether the output of the model generated has solved the problem. Otherwise, if the model fails to solve, there is a need to go back to the previous phases as the model has to be redefined.

6. Deployment - After validating the model, it is then deployed so that the end-users of the organization will be able to finally utilized and reap the benefits of the data mining solution.

Figure 1. Generic tasks and output of the CRISP-DM Model.

Learning Resources

Video on “CRISP-DM (Cross-Industry Standard Process for Data Mining)” by Dr. Eugene Rex Jalao.


Assignment 0/Discussion Forum 2

Apply the CRISP-DM on the case study to be given by your professor/faculty-in-charge.

Share your experience by posting it in our discussion forum."

Study Question Why is important for managers to know how to entity relationship diagrams are

designed?


MODULE 2: DATA PRE-PROCESSING

Introduction

This module intends to:

1. Introduce basic concepts in data pre-processing; and

2. Introduce methods of data pre-processing.

2.1. Basic Concepts in Data Pre-processing

Data preprocessing is an important step in data analytics. It aims at assessing and improving the quality of data for secondary statistical analysis. With this, the data is better understood and the data analysis is performed more accurately and efficiently.

Learning Objectives

After studying the basic concepts in data preprocessing, you should be able to:

1. Explain what data preprocessing is and why it is important in data analytics; and 2. Describe different forms of data preprocessing.

A. What is Data Pre-Processing?

Data in the real world tend to be incomplete, noisy, and inconsistent. “Dirty” data can

lead to errors in parameter estimation and incorrect analysis leading users to draw false

conclusions. Quality decisions must be based in quality data; hence, unclean data may

cause incorrect or even misleading statistical results and predictive analysis. Data

preprocessing is a data mining technique that involves transforming raw or source data

into an understandable format for further processing.


B. Tasks for Data Pre-processing

Several distinct steps are involved in preprocessing data. Here are the general steps

taken to pre-process data:

• Data cleaning

o This step deals with missing data, noise, outliers, and duplicate or

incorrect records while minimizing introduction of bias into the database.

o Data is cleansed through processes such as filling in missing values,

smoothing the noisy data, or resolving the inconsistencies in the data.

• Data integration

o Extracted raw data can come from heterogeneous sources or be in

separate datasets. This step reorganizes the various raw datasets into a

single dataset that contain all the information required for the desired

statistical analyses.

o Involves integration of multiple databases, data cubes, or files.

o Data with different representations are put together and conflicts within

the data are resolved.

• Data transformation


o This step translates and/or scales variables stored in a variety of formats

or units in the raw data into formats or units that are more useful for the

statistical methods that the researcher wants to use.

o Data is normalized, aggregated and generalized.

• Data reduction

o After the dataset has been integrated and transformed, this step removes

redundant records and variables, as well as reorganizes the data in an

efficient and “tidy” manner for analysis.

o Pertains to obtaining reduced representation in volume but produces the

same or similar analytical results.

o This step aims to present a reduced representation of the data in a data

warehouse.

Pre-processing is sometimes iterative and may involve repeating this series of steps until

the data are satisfactorily organized for the purpose of statistical analysis. During

preprocessing, one needs to take care not to accidentally introduce bias by modifying

the dataset in ways that will impact the outcome of statistical analyses. Similarly, we

must avoid reaching statistically significant results through “trial and error” analyses on

differently pre-processed versions of a dataset.

Learning Resources

Dr. Eugene Rex Jalao’s video on Data Preprocessing

Activity 2-1 Watch:

Dr. Eugene Rex Jalao’s video on Data Preprocessing.

Methods of Data Pre-processing Data preprocessing consists of series of steps to transform data extracted from different

data sources into a “clean” data prior to statistical analysis. Data pre-processing includes

data cleaning, data integration, data transformation, and data reduction.


Learning Objectives After studying the methods of data processing, you should be able to:

1. Understand the different methods of data preprocessing; and

2. Differentiate the different techniques of data preprocessing.

A. Data Integration

Data integration is the process of combining data derived from various data sources

(such as databases, flat files, etc.) into a consistent dataset. In data integration, data

from the different sources, as well as the metadata - the data about this data - from

different sources are integrated to come up with a single data store. There are a number

of issues to consider during data integration related mostly to possible different

standards among data sources. These issues could be entity identification problem, data

value conflicts, and redundant data. Careful integration of the data from multiple sources

may help reduce or avoid redundancies and inconsistencies and improve data mining

speed and quality of sources.

Four Types of Data Integration Methodologies

1. Inner Join - creates a new result table by combining column values of two tables

(A and B) based upon the join-predicate.

2. Left Join - returns all the values from an inner join plus all values in the left table

that do not match to the right table, including rows with NULL (empty) values in

the link column.

3. Right Join - returns all the values from the right table and matched values from

the left table (NULL in the case of no matching join predicate).

4. Outer Join - the union of all the left join and right join values.


Learning Resources

Dr. Eugene Rex Jalao’s video on Data Integration

Activity 2-1

Watch: Dr. Jalao’s video on Data Integration.


Study Questions

1. How are the alternative data warehousing architectures different from the usual

architecture?

2. Discuss the advantages and disadvantages of the different alternative data

warehousing architectures.

B. Data Transformation

Data transformation is a process of transforming data from one format to another. It aims

to transform the data values into a format, scale or unit that is more suitable for analysis.

Data transformation is an important step in data preprocessing and a prerequisite for

doing predictive analytic solutions.

Here are a few common possible options for data transformation:

1) Normalization - a way to scale specific variable to fall within a small specific range

a) min-max normalization - transforming values to a new scale such that all

attributes fall between a standardized format.

b.) Z-score standardization - transforming a numerical variable to a standard

normal distribution


2.) Encoding and Binning

a) Binning - the process of transforming numerical variables into categorical

counterparts.

i) Equal-width (distance) partitioning - divides the range into N intervals of equal

size, thus forming a uniform grid.

ii.) Equal-depth (frequency) partitioning

• Divides the range into N intervals, each containing

approximately the same number of samples.


b.) Encoding - the process of transforming categorical values to binary or

numerical counterparts, e.g. treat male or female for gender to 1 or 0.

Data encoding is needed because some data mining methodologies,

such as Linear Regression, require all data to be numerical.

i.) Binary Encoding (Unsupervised)

• Transformation of categorical variables by taking the

values 0 or 1 to indicate the absence or presence of each

category.

• If the categorical variable has k categories, we would need

to create k binary variables.

ii.) Class-based Encoding (Supervised)

• Discrete Class


Replace the categorical variable with just one new numerical

variable and replace each category of the categorical variable

with its corresponding probability of the class variable.

Continuous Class

Replace the categorical variable with just one new numerical

variable and replace each category of the categorical variable with

its corresponding average of the class variable.


• Learning Resources

• Dr. Eugene Rex Jalao’s video on Data Transformation

• Dr. Eugene Rex Jalao’s video on Data Encoding

C. Data Cleaning

All data sources potentially include errors and missing values – data cleaning addresses

these anomalies. Data cleaning is the process of altering data in a given storage

resource to make sure that it is accurate and correct. Data cleaning routines attempts to

fill in missing values, smooth out noise while identifying outliers, and correct

inconsistencies in the data, as well as resolve redundancy caused by data integration.


Data Cleaning Tasks:

a) Fill in missing values

Solutions for handling missing data:

i. Ignore the tuple

ii. Fill in the missing value manually

iii. Data Imputation

- Use a global constant to fill in the missing value

- Use the attribute mean to fill in the missing value

- Use the attribute mean for all samples belonging to the

same class

b) Cleaning noisy data

Solutions for cleaning noisy data:

i. Binning - transforming numerical values into categorical

components

ii. Clustering - grouping data into corresponding cluster and use the

cluster average to represent a value

iii. Regression - utilizing a simple regression line to estimate a very

erratic data set

iv. Combined computer and human inspection - detecting suspicious

values and checking it by human interventions

c) Identifying outliers

Solutions for identifying outliers:

i. Box plot

Activity 2-2 Watch:

Dr. Jalao’s video on Data Cleaning.

D. Data Reduction and Manipulation

Data reduction is a process of obtaining a reduced representation of the data set that is

much smaller in volume but yet produce the same (or almost the same) analytical

results. The need for data reduction emerged from the fact that some database/data

warehouse may store terabytes of data, and complex data analysis/mining may take a

very long time to run on the complete data set.


Data Reduction Strategies:

a. Sampling - utilizing a smaller representative or sample from the big data set or

population that will generalize the entire population.

i.Types of Sampling

1. Simple Random Sampling - there is an equal probability of

selecting any particular item.

2. Sampling without replacement - as each item is selected, it is

removed from the population

3. Sampling with replacement - objects are not removed from the

population as they are selected for the sample

4. Stratified sampling - split the data into several partitions, then

draw random samples from each partition.

b. Feature Subset Selection - reduces the dimensionality of data by eliminating

redundant and irrelevant features.

i.Feature Subset Selection Techniques

1. Brute-force approach - try all possible feature subsets as input to

data mining algorithm

2. Embedded approaches - feature selection occurs naturally as part

of the data mining algorithm

3. Filter approaches - features are selected before data mining

algorithm is run

4. Wrapper approaches - use the data mining algorithm as a black

box to find the best subset or attributes

c. Feature Creation - creating new attributes that can capture the important

information in a data set much more efficiently than the original attributes.

i.Feature Creation Methodologies

1. Feature Extraction

2. Mapping Data to New Space

3. Feature Construction

Learning Resources

Dr. Eugene Rex Jalao’s video on Data Reduction and Manipulation.


Activity 2-3

Watch:

Dr. Jalao’s video on Data Reduction and Manipulation

Other References Used for Module 2:

A Comprehensive Approach Towards Data Preprocessing Techniques & Association

Rules Jasdeep Singh Malik, Prachi Goyal, 3 Mr.Akhilesh K Sharma 3 Assistant

Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India. Available at URL

https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACo

m10_279_Paper%20(2).pdf

Son NH (2006) Data mining course—data cleaning and data preprocessing. Warsaw

University. Available at URL http://www.mimuw.edu.pl/~son/datamining/DM/4-

preprocess.pdf

Malley B., Ramazzotti D., Wu J.T. (2016) Data Pre-processing. In: Secondary Analysis

of Electronic Health Records. Springer, Cham. Available at URL

https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2

https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdf

https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdf

http://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdf



https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2


MODULE 3: SUPERVISED LEARNING

Introduction

In Module 1, you already have encountered the definition of supervised learning. This

time, Module 3 discusses supervised learning in greater detail. Specifically, supervised

learning methodologies can be categorized into two: classification, the prediction of a

class or category from several predictor variables; and regression, the prediction of

numerical value from one or more predictors. This module also tackles how regression

and classification may be evaluated.

Learning Objectives


• Differentiate classification from regression. • Identify and discuss appropriate supervised learning methodologies for various

scenarios and business problems. • Build and validate prediction models by utilizing supervised learning

methodologies. • Evaluate the performance of the prediction model.

3.1. Classification

Given a collection of records, let us say that we have multiple predictor variables (x1, x2, xp) and one categorical response (y). Here, we intend to find a model for predicting the class variable from multiple predictor variables. This is the essence of classification; a categorical response is predicted from multiple predictor variables. In classification, historical data are used to build a model and the goal is to predict previously unseen records. There are several classification algorithms and some of these are listed below:

ZeroR The simplest classification methodology which relies on the target and ignores all predictors.

OneR Simple yet accurate classification algorithm that generates one rule for each predictor in the data.


Naïve-Bayes A frequency-based classifier that uses a probabilistic framework and conditional probability to classify data.

Decision Tree It builds classification models in the form of a tree structure that represents rules that can be easily understood.

Nearest Neighbours An intuitive method that classified unlabeled data based on their similarity with sample examples in the training set. It utilizes distance as a similarity measure in making predictions.

Artificial Neural Network (ANN) A network of perceptrons or nodes that mimic a biological network of neurons in the brain.

Support Vector Machine (SVM) It performs classification by finding a plane that maximizes the margin between the two classes.

Ensemble This predicts the class of previously unseen records by aggregating predictions made by multiple classifiers.

Random Forests A relatively modern algorithm that is essentially an ensemble of decision trees.

Learning Resources

• Video on “Classification” by Dr. Eugene Rex Jalao.

• Video on “Naive Bayes” by Dr. Eugene Rex Jalao

• Video on “Decision Trees” by Dr. Eugene Rex Jalao

• Video on “Nearest Neighbours” by Dr. Eugene Rex Jalao

• Video on “Artificial Neural Networks” by Dr. Eugene Rex Jalao

• Video on “Support Vector Machines” by Dr. Eugene Rex Jalao

• Video on “Ensembles” by Dr. Eugene Rex Jalao

• Video on “Random Forests” by Dr. Eugene Rex Jalao


3.1.1. Classification Model Evaluation We will not know how well the model performs unless model evaluation comes into the

picture. Model evaluation is a methodology used to find the model that represents the

data and how well the chosen model will work in the future. Listed below are questions

necessary to be answered in model evaluation:

1. How do we evaluate the performance of a classification model? Appropriate

metrics have to be defined.

2. How do we obtain reliable estimates of these errors? The methods for

performance evaluation, and the concepts of overfitting and underfitting have to

be considered.

3. How do we compare the relative performance among competing models?

Thus, if there are multiple prediction models or algorithms, how are these models

compared? Which one among them will be chosen to deploy in business?

Learning Resources

Video on “Model Evaluation” by Dr. Eugene Rex Jalao.

Assignment 1 Your faculty-in-charge will give you your assignment on Classification Methodologies.

3.2. Regression

Regression is a data mining task of predicting the target’s value, i.e., the numerical

variable (y), by building a model based on one or more predictors, which can be

numerical and categorical variables.

There are many applications of regression such as data description, parameter

estimation, prediction/estimation, and control. In this regard, regression analysis,

arguably the most widely used (and misused) statistical technique, is utilized. Multiple

linear regression (MLR) is a method used to model the linear relationship between a

target variable and more than one predictor variables.


Learning Resources

Video on “Regression” by Dr. Eugene Rex Jalao.

3.2.1. Regression Model Evaluation

In regression, we predict actual values that are numerical in nature. Considering this,

how do we know whether the predictions are accurate or whether the regression model

is valid? Listed below are questions to be considered in evaluating a regression model.

1. Is at least one of the predictors useful in the predicting the response? If this is not

the case, we cannot predict Y in the first place because none of the predictors

are useful.

2. How well does the model fit the data? Is it a good fit?

3. Given a set of predictor values, what is the prediction’s response value?

4. Are there any outliers that might influence the coefficients?

5. Do all of the predictors help explain Y or is only a subset of the predictors useful?

Learning Resources

Video on “Regression Model Evaluation” by Dr. Eugene Rex Jalao.

3.2.2. Indicator Variables

Qualitative variables do not have a scale measurement. In this regard, indicator

variables, also known as dummy variables, are used to model qualitative variables in

regression. Indicator variables assigns levels to qualitative variables/categories so that

regression analysis can be performed on them.

Learning Resources

Video on “Indicator Variables” by Dr. Eugene Rex Jalao.


3.2.3. Multicollinearity

Multicollinearity is the inflation of coefficient estimates due to interdependent regression.

This implies that if the regressors (or the predictor variables) are dependent with each

other (i.e. correlation among x1, x2, x3…), then the problem of multicollinearity exists in the

final regression model. The said problem does not exist if all regressors are completely

independent with each other; however, this is a rare occurrence in regression analysis.

Usually, there is interdependence among predictor variables to a certain extent.

The effect of strong multicollinearity is that it can result in large variances and

covariances for the least square estimates of the coefficients. Large variances implies

unstable predictions, and coefficient estimates would be very sensitive to minor changes

in the regression model. Thus, the question now is how is multicollinearity detected? We

want procedures to correct identify the presence of multicollinearity and to provide

insights as to which regressors are causing the problem.

Learning Resources

Video on “Multicollinearity” by Dr. Eugene Rex Jalao.

3.2.4. Logistic Regression

Logistic regression predicts the probability of an outcome that can only have two values.

As such, it can be considered a classification algorithm. Yes, it predicts probability but

we can consider this example: anything below 0.5 is a class/category, and anything

above 0.5 is another class/category. Furthermore, the prediction uses one or several

predictors, which can be numerical and categorical in nature.

Learning Resources

Video on “Logistic Regression” by Dr. Eugene Rex Jalao.

Assignment 2 Your faculty-in-charge will give you your assignment on Regression Methodologies.


MODULE 4: UNSUPERVISED LEARNING

Introduction

Module 4 talks about unsupervised learning where we find hidden patterns within the data. There is no response or class variable like in classification or regression. Moreover, in unsupervised learning, there is no guarantee that there are meaningful patterns.

Learning Objectives


1. Identify and discuss appropriate unsupervised learning methodologies for various scenarios and business problems.

2. Build and validate prediction models by utilizing unsupervised learning methodologies.

3. Evaluate the performance of the prediction model.

4.1. Association Rule Mining

Association rule mining is a rule-based method for discovering relationships between variables. It is intended to identify strong rules using certain measures of interestingness. Specifically, given a set of transactions, we find rules that will predict the occurrence of an item based on the occurrences of other items in a particular transaction.

Figure 1. Example of Association Rules


Learning Resources Video on “Association Rule Mining” by Dr. Eugene Rex Jalao

4.2. Sequential Pattern Mining

Sequential pattern mining is concerned with finding statistically relevant patterns within

the time series data where values are delivered in a sequence. A sequence is an

ordered list of elements or transactions, whereas an element contains a collection of

events or items. Each element is attributed to a specific time or location of a particular

transaction. Sequential pattern mining is performed by growing subsequences or

patterns one at a time.

Figure 2. Example of sequence data.

Learning Resources Video on “Sequential Pattern Mining” by Dr. Eugene Rex Jalao

4.3. Clustering

Clustering is the task that of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those belonging in other clusters.

4.3.1. K-Means Clustering K-means clustering is a basic partitional clustering approach or algorithm that classifies or groups objects based on attributes into k groups (clusters). The number of clusters (k) must be specified first. Each cluster is associated with a center point called the centroid, and each point in the data set is assigned to the cluster with the closest centroid.


Figure 3. Overview of K-Means.

4.3.2. Hierarchical Clustering

Hierarchical clustering produces a set of nested clusters organized as a hierarchical

tree, which is usually visualized as a dendogram. A dendrogram is a tree-like diagram

that records the sequences of merges or splits in the data set. One of the advantages of

the said clustering method is that there is no need to assume any particular number of

clusters, unlike K-means. Any desired number of clusters can be obtained by cutting the

dendrogram at an appropriate level.

Figure 4. Sample dendrogram.


Learning Resources Video on “Hierarchical Clustering” by Dr. Eugene Rex Jalao

4.4. Text Mining

Text mining, also known as text data mining or knowledge discovery in textual

databases, is a semi-automated process of extracting knowledge from unstructured data

sources. Benefits of text mining are obvious in text-rich data environments such as in

law, academic research, medicine, biology, technology, finance, and marketing. It can

also be utilized in electronic communication records; examples of which include spam

filtering, email prioritization and categorization, and automatic response generation.

Basically, text mining consists of these steps:

1. Establish the corpus.

2. Create term-document matrix.

3. Extract knowledge from term-document matrix.

Learning Resources Video on “Text Mining” by Dr. Eugene Rex Jalao

4.5. Social Media Sentiment Analysis

Social media sentiment analysis takes two main types of textual information: facts and

opinions. We take note that factual statements can imply opinions too. Most current text

information processing methods (e.g. text mining) work with factual information. In

essence, these textual data can be extended into sentiment analysis or opinion mining,

defined as the computational study of opinions, sentiments, and emotions expressed in

text.

Opinions are important because whenever a decision has to be made, we want to hear

the opinions of others. In the past, we ask opinions from friends, family, focus groups,

and consultants. Now, due to the advent of the Internet, opinions are of global scale.


Learning Resources Video on “Social Media Sentiment Analysis” by Dr. Eugene Rex Jalao

Assignment 3 Your faculty-in-charge will give you your assignment on Unsupervised Learning

Methodologies.


FINAL EXAM

IDENTIFICATION / FILL-IN THE BLANKS

# Answer Item

1 Predictive Analytics

/ Data Mining

Exploration and analysis of a large quantity of data to

discover meaningful patterns by automatic or semi-automatic

means.

2 Classification Historical data are used to build a model and the goal is to

predict previously unseen records.

3 ZeroR The simplest classification methodology which relies on the

target and ignores all predictors.

4 Naïve-Bayes A frequency-based classifier that uses a probabilistic

framework and conditional probability to classify data.

5 Nearest

Neighbours

An intuitive method that classified unlabeled data based on

their similarity with sample examples in the training set.

6 Regression A data mining task of predicting the target’s value by building

a model based on one or more predictors, which can be

numerical and categorical variables.

7 Indicator Variables This is used to model qualitative variables in regression;

assigns levels to such qualitative variables so that regression

analysis can be performed on them.

8 Logistic Regression It is considered a classification algorithm predicts the

probability of an outcome that can only have two values.

9 Association Rule

Mining

A rule-based method that is intended to identify strong rules

using certain measures of interestingness.


10 K-Means Clustering A basic partitional approach or algorithm that classifies or

groups objects based on attributes into groups

11 Dendogram A tree-like diagram that records the sequences of merges or

splits in the data set.

12 Text Mining A semi-automated process of extracting knowledge from

unstructured data sources.

SOMETHING TO PONDER / ESSAY

In no more than 4 sentences, explain your answers to the following items:

1. Justify the necessity of utilizing the CRISP-DM Framework in predictive analytics.

2. Differentiate a decision tree from a random forest.

3. Differentiate supervised learning from unsupervised learning.

4. What is the difference of classification, regression, and clustering?

5. What is the significance of data preprocessing?

Documents

Fundamentals of Predictive Analytics