52
Fundamentals of Descriptive Analytics A Business Analytics Course University of the Philippines Open University Dr. Melinda Lumanta Ms. Louise Villanueva Dr. Eugene Rex Jalao Ms. Marie Karen Enrile Asst. Prof. Joyce Manalo Course Writers

Fundamentals of Descriptive Analytics€¦ · Prerequisite: Fundamentals of Data Warehousing COURSE OBJECTIVES At the end of the course, the students should be able to: 1. Explain

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • Fundamentals of

    Descriptive Analytics A Business Analytics Course

    University of the Philippines Open University

    Dr. Melinda Lumanta Ms. Louise Villanueva Dr. Eugene Rex Jalao Ms. Marie Karen Enrile

    Asst. Prof. Joyce Manalo

    Course Writers

  • Fundamentals of Descriptive Analytics 1

    University of the Philippines

    OPEN UNIVERSITY

    COMMISSION ON HIGHER EDUCATION

  • Fundamentals of Descriptive Analytics 2

    Fundamentals of Descriptive Analytics

    Course Package

    This learning package consists of:

    1. Course Guide 2. Study Guides 3. Video lectures (Available at UPOU Networks and in the

    attached USB) 4. Assessments

  • Fundamentals of Descriptive Analytics 3

    UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY

    Fundamentals of Descriptive Analytics

    A Business Analytics Course

    This course aims to introduce students to the fundamentals of descriptive analytics.

    Descriptive analytics make use of current transactions to enable managers to visualize

    how the company is performing. This course will teach students to learn how to prepare

    reports using descriptive analytic tools.

    Prerequisite: Fundamentals of Data Warehousing

    COURSE OBJECTIVES

    At the end of the course, the students should be able to:

    1. Explain the concepts in descriptive statistics

    2. Contextualize the descriptive statistics concepts and analytical techniques in

    business decision-making

    3. Explain the importance of data pre-processing

    4. Apply data pre-processing techniques in business

    5. Explain the importance of data visualization and communication

    6. Apply data visualization techniques to communicate the results of descriptive

    analytics to stakeholders

    7. Develop an awareness of ethical norms as required under policies and applicable

    laws governing confidentiality and non-disclosure of data/information/documents

    and proper conduct in the learning process and application of business analytics.

  • Fundamentals of Descriptive Analytics 4

    COURSE OUTLINE

    UNIT I. Introduction to Descriptive Analytics

    MODULE 1. Statistics in Business

    A. Data and data sets

    B. What is statistics?

    C. Bases in choosing what statistics to use

    D. Application of statistics in business

    MODULE 2. Basic Descriptive Statistics

    A. Frequency distributions

    B. Measures of location

    C. Measures of dispersion

    D. Measures of association

    E. Measures of shape and other statistics

    MODULE 3. Sampling and Data Collection

    A. Types of sampling

    B. Central limit theorem

    Unit II. Data Preprocessing

    MODULE 1: Basic Concepts in Data Preprocessing

    A. What is data pre-processing?

    B. Tasks for data processing

    Module 2: Methods for Data Preprocessing

    A. Data Integration

    B. Data Transformation

    C. Data Cleaning

    D. Data Reduction

    MODULE 3: Post-Processing and Visualization Of Data Inside The Data Warehouse

  • Fundamentals of Descriptive Analytics 5

    Unit III. Data Visualization and Communication

    Unit IV. Ethics

    COURSE MATERIALS

    1. Course guide

    2. Study guides per modules

    3. Video lectures

    4. Additional reading materials in digital forms

    STUDY SCHEDULE

    Schedule Topic Activity

    Week 1 Course Overview 1. Read the course guide

    2. Participate in Discussion Forum 1 -

    Introduce yourself and write a brief

    reflection paper about the importance of

    big data in businesses today

    Week

    2 - 4

    Unit I - Introduction to

    Descriptive Analytics

    1. Go through Module 1 to 3

    2. Participate in Discussion Forums 2 to 4

    3. Watch the videos on Basic Descriptive

    Statistics and Sampling and Data

    Collection by Dr. Lisa S. Bersales

    4. Submit the required assignment as

    specified in the study guide.

    Week

    5-9

    Unit II - Data Pre-

    processing

    1. Go through Module 5 to 6

    2. Watch the videos on Data Processing by

    Dr. Eugene Rex L. Jalao

    3. Participate in Forum Discussion 5

  • Fundamentals of Descriptive Analytics 6

    4. Submit the required assignment as

    specified in the study guide.

    Week

    10 - 14

    Unit III - Data

    Visualization and

    Communication

    1. Go through Module 7.

    2. Watch the video on Data Visualization and

    Communication

    3. Participate in Discussion Forum 6

    Week 15 Unit IV - Ethics 1. Go through Unit IV.

    2. Watch the video on Ethics by Atty.

    Emerson Banes and Mr. Dominic Ligot

    3. Write a reflection paper on ethics in

    descriptive analytics in business.

    Week 16 Course Evaluation 1. Write a self-reflection on how the course

    contributed to your understanding of

    descriptive analytics in business.

    COURSE REQUIREMENTS

    For you to pass the course, you will be evaluated on the following required activities:

    Unit Weight

    1 20%

    2 35%

    3 35%

    4 10%

  • Fundamentals of Descriptive Analytics 7

    Online Discussions

    There will be a series of online discussions and activities for this course. In addition to

    gauging your understanding of the course topics, the online discussions provide

    everybody an opportunity to apply the concepts discussed in the modules in specific

    situations.

    As we progress through the course, we will be posting discussion topics and specific

    questions/ instructions, so make it a point to visit the course regularly.

    Remember the following when participating in online discussions:

    • All discussions will take place in the course site. A separate discussion forum will

    be created for each topic.

    • Everybody is encouraged to contribute to the discussions by answering the

    discussion question and/or reacting to each discussion topic if you wish to acquire

    the Certificate of Completion. Passing remarks like "I agree" are not considered

    substantial.

    • Do not post lengthy contributions. Be clear on what your main point is and express

    it as concisely as possible.

    • The forums will remain open throughout the course's duration.

    • Please be guided by netiquette rules (see

    http://www.albion.com/netiquette/corerules.html) when participating in online

    discussions. Respond to other postings courteously. Personal messages should

    be emailed directly to the person concerned.

    • If you would like to use some printed or online reference materials in your posting,

    don't forget to cite them accordingly (e.g., According to Hernandez (2010), this

    concept is...).

    Assignments

    The assignment is intended to help you to integrate and apply the learning. Specific

    instructions will be posted in the course site.

    If you wish to get the Certificate of Completion for this course, you must submit and get a

    passing mark in the assignment. Online submission of assignments will be in the

    Assignment Bin.

  • Fundamentals of Descriptive Analytics 8

    GENERAL GUIDELINES Please comply with the following house rules:

    • You are always expected to uphold academic integrity and intellectual honesty as

    a learner. Cheating or plagiarism is not allowed.

    • Submit your assignment on time.

    • Observe deadlines. Follow the schedule of course activities, submit your

    assignments on time, and never ask for an exemption from a required task. Read

    in advance. Try to anticipate possible conflicts between your personal schedule

    and the course schedule, and make the necessary adjustments to your study

    schedule.

    • Limit the comments and materials you post to those that are relevant to the course

    topics. For your profile photo, do not post an informal photo or a photo that would

    be more appropriate for a personal website. Maintain a professional demeanour

    in all courses.

  • Fundamentals of Descriptive Analytics 9

    UNIT I: INTRODUCTION TO DESCRIPTIVE ANALYTICS

    This unit intends to:

    1. Introduce basic statistical concepts; 2. Introduce basic descriptive statistics; and 3. Introduce sampling and data collection.

    MODULE 1: STATISTICS IN BUSINESS

    1.1. Data and Data Sets

    This part of the module is intended to familiarize the students with data and data sets

    which serve as foundations of statistics and analytics.

    Learning Objectives

    At the end of this part of the module the students must be able to do the following:

    1. Differentiate the types of data and levels of measurement

    2. Differentiate and understand where data sets must be best used

    3. Differentiate the two branches of statistics:

    a. Descriptive statistics

    b. Inferential statistics

    4. Determine appropriate use of statistics in business analytics.

    Key Concepts

    Attributes and Variables

    Anyone who wants to embark on analytics must first start with data. These data are

    composed of objects and their attributes. For example, the new human resources

    manager in a pharmaceutical company wants to know the profile of their sales

    representatives. The new human resources manager is presented with the table below.

  • Fundamentals of Descriptive Analytics 10

    Sales Representative’s Performance for 1st Quarter 2018

    Sales Representative

    Age Sex City Standardized Product Expertise Test Scores

    Total Amount of Quarterly Sales

    Rank Based on Quarterly Sales

    Abad, Maria 23 Female Caloocan 96 3,500,430.40 13

    Basilio, Anna 27 Female Las Piñas 91 3,850,875.30 12

    Cruz, Juan 28 Male Makati 92 5,290,320.50 2

    Delos Santos, Jose

    23 Male Malabon 94 3,216,739.95 14

    Encarnacion, Leonora

    24 Female Mandaluyong 94 4,589,850.00 6

    Fajardo, Mario 30 Male Manila 95 4,670,902.25 5

    Guzman, Emilio 22 Male Marikina 97 3,993,741.50 9

    Herminio, Adela 35 Female Muntinlupa 96 3,890,004.70 10

    Ilagan, Bienvenido

    28 Male Navotas 93 2,863,045.25 15

    Jacob, Rosa 29 Female Parañaque 92 4,097,589.75 8

    Kalaw, Clarissa 22 Female Pasay 98 3,856,910.15 11

    Lagman, Francisco

    28 Male Pasig 94 4,970,438.25 4

    Montero, Antonio 31 Male Pateros 92 1,368,495.45 16

    Nuñez, Isabel 25 Female Quezon City 97 5,400,369.90 1

    Ortiz, Katrina 27 Female San Juan 90 4,283,907.72 7

    Pantaleon, Roel 34 Male Taguig 96 5,278,900.80 3

    In this table, the objects are the sales representatives and the attributes are the

    characteristics that these representatives have—age, sex, area or city of assignment, test

    scores, and their quarterly performance in the form of sales.

  • Fundamentals of Descriptive Analytics 11

    Variables and Levels of Measurement

    In statistics, attributes which are organized for further data processing are called variables.

    There are different types of variables, and each type has properties that determine how it

    can be subjected to analysis.

    Nominal variables are characterized by distinctness. They are labels and are non-

    numerical. In the table, we can say that the sales representative’s sex and city of

    assignment are nominal variables. Males are distinct from females. No intrinsic ordering

    can be observed in the cities of assignment.

    Another type of variable is the ordinal variable. These are variables that pertain to order.

    The ranks given to the sales representatives based on their quarterly sales are considered

    as ordinal. From the data on rank, we can say that Juan Cruz who ranked 2nd sold more

    than Roel Pantaleon who ranked 3rd but less than Isabel Nuñez who ranked 1st. Looking at

    the ranks of these sales representatives does not give an idea about the difference

    between the quarterly sales of these sales representatives.

    The last three variables in the table are product expertise test scores, age, and quarterly

    sales. These variables can be classified into what has been referred together as interval-

    ratio variables. These two are often taken together since these variables are quantitative

    in nature. The standardized product expertise test scores are considered as interval

    variables. This is because standardized test scores like the IQ are often of arbitrary origin

    (one does not necessarily start with zero) and there is a fixed distance between the scores.

    For examples, we can say that the distance between 91 and 92 is also equal to the

    distance between 95 and 96.

    Meanwhile, variables such as age and quarterly sales are considered as ratio variables.

    These variables have the characteristics of nominal, ordinal, and interval variables.

    However, unlike interval variables, ratio variables have absolute zero origins. At some

    point, a sales representative does start with zero sales for the quarter. Age as a variable

    is also characterized by a meaningful zero point which is upon someone’s birth.

    Quantitative variables such as interval and ratio may be discrete or continuous. Discrete

    variables are those that take the form of an integer. Meanwhile continuous variables take

    the form of real numbers.

    Nominal, ordinal, interval, and ratio variables are also referred to as levels of

    measurement. The numerical properties of interval and ratio variables permit its use in

    higher statistical tests. Meanwhile, nominal and ordinal variables are often used for

    descriptive purposes.

  • Fundamentals of Descriptive Analytics 12

    Study Question Think of an interesting phenomenon that you want to study in your organization. List all of the possible variables and categorize them according to types.

    Data Sets

    While aggregated data are important, sets of data are proved to be more useful to

    organizations. These permit organizations to analyze and interpret scenarios effectively

    and efficiently. There are three main types of data sets: record, graph, and ordered data

    sets.

    Record data set

    Record data sets are those that are structured and presented in rows. Record data sets

    may come in texts, numbers, or sequences.

    The table about quarterly sales is considered as a collection of record data.

    Sales Representative

    Age Sex City Standardized Product Expertise Test Scores

    Total Amount of Quarterly Sales

    Rank Based on Quarterly Sales

    Abad, Maria 23 Female Caloocan 96 3,500,430.40 13

    Basilio, Anna 27 Female Las Piñas 91 3,850,875.30 12

    Cruz, Juan 28 Male Makati 92 5,290,320.50 2

    Delos Santos, Jose

    23 Male Malabon 94 3,216,739.95 14

    Encarnacion, Leonora

    24 Female Mandaluyong 94 4,589,850.00 6

    Fajardo, Mario 30 Male Manila 95 4,670,902.25 5

    Guzman, Emilio 22 Male Marikina 97 3,993,741.50 9

    Herminio, Adela 35 Female Muntinlupa 96 3,890,004.70 10

  • Fundamentals of Descriptive Analytics 13

    Ilagan, Bienvenido

    28 Male Navotas 93 2,863,045.25 15

    Jacob, Rosa 29 Female Parañaque 92 4,097,589.75 8

    Kalaw, Clarissa 22 Female Pasay 98 3,856,910.15 11

    Lagman, Francisco

    28 Male Pasig 94 4,970,438.25 4

    Montero, Antonio

    31 Male Pateros 92 1,368,495.45 16

    Nuñez, Isabel 25 Female Quezon City 97 5,400,369.90 1

    Ortiz, Katrina 27 Female San Juan 90 4,283,907.72 7

    Pantaleon, Roel 34 Male Taguig 96 5,278,900.80 3

    Another kind of record data is an m x n data matrix where m represents the rows or the

    numerical objects and n the columns for the numerical attributes. This is a matrix

    composed of real numbers.

    Projection of x Load Projection of y Load Distance Load Thickness

    10.23 5.27 15.22 2.7 1.2

    12.65 6.25 16.22 2.2 1.1

    Aside from m x n data matrix, record data can also be in the form of term by document

    data set. This serves as a means to count how many times a term appears in a document.

    Area Sales Quota

    Document 1 1 4 3

    Document 2 2 5 3

    Document 3 1 9 4

    A special kind of record data is composed of combination of items or services that are often bought or lumped together. This is called the market basket or transaction data set.

  • Fundamentals of Descriptive Analytics 14

    Transaction ID Items

    1 Coffee, Pancakes

    2 Coffee, Pancakes, Hash Brown

    3 Pineapple Juice, Pancakes, Hash Brown

    4 Pineapple Juice, Rice, Egg

    5 Pineapple Juice, Rice, Egg, Beef Steak

    Graph data set

    Graph data sets are those that represent relationships through the interconnections of

    points. This can be commonly observed in sociograms and matrices that show the

    interaction between and among individuals in networks.

    . Ordered data set Ordered data sets are those that show data over certain sequences, periods, or progressions. One of the most common ordered data sets is the time series. This includes data of a certain variable over a period of time.

  • Fundamentals of Descriptive Analytics 15

    Sales per quarter for the last 5 years (in millions)

    Year Quarter 1 Quarter 2 Quarter 3 Quarter 4

    2011 25 21 24.3 29.3

    2012 25.4 19.7 25.6 27.9

    2013 23 20.1 26.2 28.9

    2014 26.7 18.3 24.5 29

    2015 25.2 20.9 26 28.2

    1.2. Introduction to Statistics

    The part of the module is intended to familiarize the students with descriptive and

    inferential statistics. This will serve as the backbone for the forthcoming modules that

    tackle statistical measures and tests that could be applied to describe, predict, and infer

    information based on the available data.

    Key Concepts

    Statistics as a field of study and in business

    Word cloud of basic statistical concepts

  • Fundamentals of Descriptive Analytics 16

    Statistics as a field could be traced in Europe during the 1500s. Statesmen and scholars

    from Great Britain, France, and Sweden were urged to make sense of data gathered from

    the census (Stephenson, 2000). In 1662, the first demographic report on mortality was

    produced by John Graunt based on the weekly mortality reports in London (Encyclopaedia

    Britannica, 2012).

    Many statistical reports on demographics emerged as the field also progressed from

    description to inference. This could largely be attributed to the advancement in

    mathematics, particularly in probability theory. Given the said history, statistics to this day

    is inextricably linked to the field of mathematics while others consider statistics as a branch

    of science. In this regard, statistics can be deemed as a meta-science or meta-language

    that aims to collect, analyze, summarize, and interpret data (Stephenson, 2000).

    While the use of statistics can be traced to the affairs of nation states, it has been proven

    to be an integral part of knowledge creation in both the natural and the social sciences. In

    fact, statistics as a field was viewed with complexity that it was only deemed accessible to

    those who chose it to be their field of expertise. This led many to overlook its usefulness

    until Sir Geoffrey Heyworth of the Royal Statistical Society (1950) made a case for the use

    of statistics in business. Accordingly, business statistics must be simple for a businessman

    to comprehend. It must also guide action, but it should not serve as a substitute to a

    businessman’s judgment. Heyworth further recognized the application of statistics in

    various facets of a business. However, he also warned that business statistics should be

    guided by a businessman’s knowledge and experience because figures would only make

    sense if coupled with an understanding of contexts. While there was no hard and fast rule

    as to where statistics should be used, Heyworth considered it as a never-ending process

    to make an idea more accurate and acceptable as it often served as a bridge between

    initial and informed business judgments.

    Heyworth’s assertions in 1950 are still relevant to this day. Businesses employed

    statisticians who could guide them in making decisions as to how they could optimize their

    processes, target their consumers, and create more buzz around their products and

    services among others. These actions as guided by business statistics and the

    businessman’s knowledge and experience are made to minimize costs, maximize profits,

    and place the business in a competitive advantage.

    You have probably heard about television show ratings as publicized by rival networks.

    These ratings are examples of business statistics. Since television remains to be one of

    the most used media, advertisers of products and services would rely on television show

  • Fundamentals of Descriptive Analytics 17

    ratings to determine where they could place their advertisements to reach consumers.

    Ideally, the higher the exposure, the better.

    The advent of new information and communication technology also ushered new ways of

    processing and using business statistics. You’ve probably been accustomed to the

    number of reactions, views, and shares of social media posts and uploads. These are

    aggregated to determine the reach of social media pages, and thus, these also serve as

    measures and drivers of business. Individuals and groups have created a new industry

    out of their presence and activities in social media, and this industry is anchored in

    business statistics.

    Population and Parameters

    Statistics as a field is associated with the research process. This entailed the gathering of

    data from concerned parties. One important concept in the field is population. This pertains

    to all of the items or individuals that a researcher, or in this case, a businessman would

    want to study. Once these data have been gathered completely, the characteristics of the

    items or individuals are called parameters. Let’s say that there is a businessman who sells

    his product via an online platform. He’s interested to determine the customer satisfaction.

    Based on the analytics provided by the online platform, a total of 10,000 customers bought

    the product for the year. The businessman decides to conduct a survey of all 10,000

    customers to get their demographics and an accurate product satisfaction rating. The

    10,000 customers are considered as the population and their demographic details and

    product satisfaction ratings are the parameters.

    Sample and Statistic

    Let’s say that the businessman consults the other members of his team regarding his plan

    to conduct a survey with the population of 10,000 customers. The team expresses concern

    about the resources that will be needed to reach all of the 10,000 customers. With their

    knowledge and experience in business statistics, the businessman and his team decide

    to select only 2,000 of the 10,000 customers for the year. These 2,000 customers are

    considered as the sample and their demographic details and product satisfaction ratings

    are the statistic.

    While the use of population to understand one’s business is more accurate and secured,

    the use of samples is undeniably more effective and efficient.

  • Fundamentals of Descriptive Analytics 18

    An illustration of the concepts of population and parameters and sample and statistic

    The two branches of statistics

    Once the data are gathered from either the population or the sample, the analytic part of

    business statistics comes in. Businessmen may opt to subject the data to descriptive

    statistics or the branch of statistics that deals with procedures used to describe and

    summarize the important characteristics of a sample or population (Mendenhall, Beaver

    & Beaver, 2006). Let’s say that the businessman and his team managed to survey all the

    10,000 consumers of his product for the year. Simple counting would inform the

    businessmen that 70% or 7,000 of his consumers are men aged 30 to 40.

    Since you’ve also been introduced to the concept of sample and statistic, you should also

    know that these are crucial to the conduct of another branch of statistics. This branch is

    called inferential statistics because this allows one to draw conclusions, make predictions,

    and decide about the population based on the data gathered from the sample

  • Fundamentals of Descriptive Analytics 19

    (Mendenhall, Beaver & Beaver, 2006). Let’s say that the businessman and his team

    forewent the survey of the population and pushed through with 2,000 randomly selected

    samples. The samples showed that 40% or 800 consumers are happy about the product

    and are considering to buy an improved version in the following year. Guided by this value,

    the businessman and his team may expect an estimate of 4,000 old consumers who may

    buy the improved version of the product.

    These simple examples showed the functions of the two branches of statistics as its

    application in the context of business.

    Descriptive Statistics

    Purpose

    Your knowledge of population and samples is useful to understanding descriptive statistics. As the name implies, descriptive statistics is meant to help you describe and summarize the parameter of the population or the statistic of the sample. This may come in different measures:

    A. Measures of location - these measures tell you the position of values in the frequency distribution. Common measures of location are what we call central tendency or mean, median, or mode. Let’s say that the businessman and his team found out that the average age of their consumers is 27. This means that majority of the consumers are 27 years old and the others are either younger or older.

    B. Measures of spread - measures of location are not enough to capture the variability among the data in the frequency distribution. This is the function of the measures of spread. These measures tell you how close or far apart certain values are in the frequency distribution. Some examples of measures of spread are range, percentiles, variance, and standard deviation. Let’s say for example that the businessman and his team found out through the survey that product satisfaction was affected by customer service. This information prompted them to conduct another study based on the call logs of customer service representatives. They found out that on average, customer service representatives could address customer concerns three days upon inquiry. However, when the businessman and his team measured the standard deviation, it yielded a value of 2. This indicates that some customer service representatives managed to address inquiries as quickly as one day or as late as five days. How could this happen? The businessman and his team should tackle the inconsistency to pave the way for more cost-efficient customer service.

  • Fundamentals of Descriptive Analytics 20

    Assumptions

    In contrast with inferential statistics, descriptive statistics only requires data that can be subjected to acceptable mathematical operations (Carbin, 2016).

    Inferential Statistics

    Purpose

    Descriptive statistics is useful when it comes to providing a summary of the data gathered

    from either the population or the sample. However, statisticians have recognized that due

    to limited resources, gathering data from the population is less popular than gathering data

    from samples. In this regard, inferential statistics can be used. In contrast to descriptive

    statistics, inferential statistics is used to estimate parameters and test hypotheses with the

    data from the samples. This can lead to the generalization of data to the population.

    Assumptions

    Unlike descriptive statistics, inferential statistics has stricter prerequisites before this can

    be applied to data. Aside from having data that can be subjected to acceptable

    mathematical operations, inferential statistics also requires unbiased estimation since only

    the samples are used to infer the parameters of the population (Carbin, 2016). This

    assumption entails the use of sampling or a process of selection where every case from

    the population has an equal chance of being selected for the sample (Healey, 2009).

  • Fundamentals of Descriptive Analytics 21

    References

    Text Book

    Open Stax, Introduction to Statistics. Open Stax CNX. September 28, 2016. https://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9de

    Healey, J.F. (2009). Statistics: A tool for social research (8th ed.). USA: Wadsworth, Cengage Learning

    Videos

    Friedman, L.W. (2016). Introduction to business statistics. Retrieved from https://www.youtube.com/watch?v=poA0KntMgSM

    Rigollet, Philippe, 18.650 Fundamentals of Statistics, Fall 2017. (Massachusetts Institute of Technology: MIT OpenCouseWare), https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0 (Accessed December 21, 2016). License: Creative Commons BY-NC-SA

    Statistics Canada. (2013). Statistics: The invisible made visible. Retrieved from https://www.youtube.com/watch?v=_4GT5v0YaOE

    SAS Software. (2013). How do you use statistics and how does it benefit your organization?. Retrieved from https://www.youtube.com/watch?v=LJV-Mlv-7dM

    Websites

    Garbin, C. (n.d.). Statistics and statistical tests: Assumptions and conclusions. Retrieved from http://psych.unl.edu/psycrs/941/q4/assumptions_141.pdf

    John Graunt. (n.d.). In Encyclopaedia Britannica. Retrieved from https://www.britannica.com/biography/John-Graunt

    Royal Statistical Society. (1950). Stephenson, D. (2000). Brief history of statistics. Retrieved from http://folk.uib.no/ngbnk/kurs/notes/node4.html

    https://cnx.org/contents/30189442-6998-4686-ac05-ed152b91b9dehttps://www.youtube.com/watch?v=poA0KntMgSMhttps://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0https://www.youtube.com/watch?v=VPZD_aij8H0&list=PLUl4u3cNGP60uVBMaoNERc6knT_MgPKS0https://www.youtube.com/watch?v=_4GT5v0YaOEhttps://www.youtube.com/watch?v=LJV-Mlv-7dMhttp://psych.unl.edu/psycrs/941/q4/assumptions_141.pdfhttps://www.britannica.com/biography/John-Graunthttp://folk.uib.no/ngbnk/kurs/notes/node4.html

  • Fundamentals of Descriptive Analytics 22

    Web Center for Social Research. (2006). Descriptive statistics. Retrieved from https://www.socialresearchmethods.net/kb/statdesc.php

    Royal Statistical Society. (1950). The use of statistics in business. The Journal of Royal Statistical Society, 113(1), 1-8. DOI: 10.2307/2980797

    https://www.socialresearchmethods.net/kb/statdesc.php

  • Fundamentals of Descriptive Analytics 23

    MODULE 2: BASIC DESCRIPTIVE STATISTICS

    Introduction

    We will focus on Summary Statistics. These are the different measures that are used to

    describe any set of data. If we want to know the typical value of a certain variable, how

    different the values are from one another, ho6w a certain data point compares to the rest,

    we can use these measures.

    2.1. Frequency Distribution

    Frequency is simply the number of occurrences of an event. A frequency distribution is a

    list, table or graph that displays the frequency of various outcomes in a sample. It tells us

    how many there are of each item in the data set.

    Frequency distribution can show us the raw number of each item and its percentage

    toward the total.

    Learning Resources Read: This online resource explains it in a simple way and shows examples https://www.spss-tutorials.com/frequency-distribution-what-is-it/ This video illustrates the concept in a novel way https://www.youtube.com/watch?time_continue=145&v=dr1DynUzjq0

    Understanding Frequency Distribution gives us a way of understanding and organizing

    our data in a logical way. Once we have done this, we will be able to apply different

    summary statistics measures to our data. These Measures are explained in the following

    sections.

    https://www.spss-tutorials.com/frequency-distribution-what-is-it/https://www.youtube.com/watch?time_continue=145&v=dr1DynUzjq0

  • Fundamentals of Descriptive Analytics 24

    2.2. Measures of Central Tendency

    Learning Resources Watch: Measures of Central Tendency, Measures of Location, Measures of Dispersion Video by Dr. Lisa Bersales [From 02:05] https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    Measures of Central Tendency give us the typical value of data. There are three measures

    of central tendency, the Mean, Median, and Mode.

    Mean

    The mean is the sum of all values of observations divided by the number of observations

    in the data set.

    Mean = ( Σ Xi )N

    Where the Mean is the summation of all values of X (from X1 to XN ) divided by the total

    number of values (N). You can see an example of this in Dr. Bersales’s video.

    Median

    The median is simple the middle value in the data set.

    Where N is the total number of Values, this is the formula of Median for odd numbers.

    Median=(N+1/ N) th term

    This is the formula for even numbers

    Median=( N/2) th term + (N/2 +1)th term2

    https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

  • Fundamentals of Descriptive Analytics 25

    Note that the formulas do not return an actual value. Instead this would return the nth term.

    This means that you need to order the data (as we learned in frequency distribution) and

    count from the beginning until you reach the term mentioned.

    Make sure to watch Dr. Bersales video to learn more about Median.

    Mode

    The mode is the value that occurs most often in the data set. There is no formula for the

    mode. Instead we can identify the mode by looking at the frequency distribution. There

    can be multiple modes. Dr. Bersales’s video discusses this further.

    Study Questions When is it best to use mean? What about median or mode? Name some specific examples of situations in which one would choose a certain measure over the two others.

    2.3. Measures of Location

    Sometimes, we want to know how a certain data point compares with the rest. This is for example, in the case of rankings and quotas. In some situations, we could also divide data into a certain number of equal sections to answer our questions, as with certain problems that would involve brackets, classes, and other groupings.

    Measures of Location specifiy points in the data set in which a specified amount of data lie. This allows us to find the position of a data in relation to the entire data set.

    Some examples of these are percentiles, deciles and quartiles. Percentiles divide the data into 100 equal parts, deciles divide the data into 10 equal parts, and quartiles divide the data into 4 equal parts.

    Median, a measure of central tendency discussed earlier, is also a special measure of location. If you can recall, the median is the middle value in the data set so it divides the data into two equal parts.

    Dr. Bersales explains this further in her video.

  • Fundamentals of Descriptive Analytics 26

    2.4. Measures of Dispersion

    Learning Material Watch: Measures of Central Tendency, Measures of Location, Measures of Dispersion Video by Dr. Lisa Bersales [From 21:19] https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    There are two types of Measures of Dispersion. First is Absolute, which is the measure of

    the variability within a data set, and relative dispersion which compares this data set with

    other data sets

    Variance and Standard Deviation are measures of dispersion with reference to the mean.

    The higher these values are, the farther away from the mean the data values are. Standard

    deviation is the square of variance, resulting in a number that is always positive and is in

    the same units as the mean.

    https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

  • Fundamentals of Descriptive Analytics 27

    MODULE 3: SAMPLING AND DATA COLLECTION

    Introduction

    Guided by the knowledge in statistics, students must also become accustomed to the

    process of sampling. This module is intended to familiarize students with the different

    types of sampling and the theory that guides the process.

    Learning Objectives

    At the end of the module, the students must be able to conduct the following:

    a. Differentiate the types of sampling; b. Understand the theory behind sampling; and c. Use sampling in the business context.

    3.1. History of Sampling

    Sampling is defined as the “a process or method of drawing a representative group of

    individuals or cases from a particular population” (Encyclopaedia Britannica, 2017). This

    process is associated with the fact that it is more effective and efficient to study samples

    taken from a population.

    Much like the history of statistics, the history of sampling has various roots. Bethlehem

    (2009) reiterated that sampling theory became a legitimate area of study in statistics

    through the works of Anders Kaier of the Norwegian Statistical Bureau. In his published

    study in 1895, Kaier presented his “Representative Method” of selecting samples based

    on the population. The Representative Method received both praises and criticisms of

    scholars, and these prompted Kaier and other statisticians to address and improve the

    method. The Representative Method’s lack of random selection was improved by Bowley

    in 1906. The works of both Kaier and Bowley led to the rise of probability and non-

    probability sampling.

  • Fundamentals of Descriptive Analytics 28

    3.2. Probability Sampling

    The use of probability sampling is guided by the probability theory particularly the law of

    large numbers and the central limit theorem. The assumption is that as the number of

    samples selected from the population increases, the likelihood that the statistic obtained

    from these samples become closer to the expected or actual values from the population

    and will follow a normal distribution.

    There are different techniques of probability sampling:

    • Simple Random Sampling - In simple random sampling, the researcher only

    implements a selection that ensures that every member of the population has an

    opportunity to be selected.

    • Stratified Sampling - This is a probability sampling technique where a

    heterogenous group was further classified into homogenous groups or strata

    where the samples will be selected. The number of samples selected per stratum

    corresponds to the percentage of the stratum to the entire population.

    • Cluster Sampling - This type of probability sampling technique is similar to stratified

    sampling. The only difference is that not all of the strata are selected. Instead, the

    researcher will first select a number of strata from which the samples will be

    randomly taken from.

    • Systematic Sampling - In systematic sampling, selection of samples starts from a

    random point and will be carried over based on a fixed interval. Some researchers

    conduct systematic sampling by generating a random number that will serve as

    the starting point and the interval.

    • Multistage Sampling - This is the combination of the probability sampling

    techniques mentioned above.

    3.3. Non-Probability Sampling

    This type of sampling is only used when researchers are not concerned about generalizing

    the results of the study to the population. Instead, the researcher only aims to get data for

    specific cases.

    Some examples of non-probability sampling are as follows:

    • Quota Sampling - In quota sampling, the researcher only ensures that a number

    of samples will be selected from all the strata. For example, a businesswoman

  • Fundamentals of Descriptive Analytics 29

    finds out that the customer-base of her cosmetic company is composed of

    Caucasian, Asian, Black, and Latina women aged 20-30. She sets to survey 10

    women from each race.

    • Purposive Sampling - Purposive sampling is the selection of participants on the

    premise that they met the criteria of the researcher. Snowball or chain sampling is

    an example of purposive sampling. For example, you want to compare and

    contrast the manufacturing practices of Japanese companies in the Philippines.

    Given the specificity of your purpose, your study does not entail random selection.

    Instead, you will be driven more by the criterion in the selection of your sample.

    • Convenience Sampling - Convenience sampling relies solely on availability. For

    example, a chef hands over a survey form to every customer who eats at his

    restaurant to determine the level of satisfaction on the products and services.

    Study Question When is it appropriate to use probability sampling or non-probability sampling?

    References

    “Measures of Central Tendency, Measures of Location, Measures of Dispersion” (Video) by Dr. Lisa Bersales https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6

    Bethlehem, J. (2009). The rise of survey sampling. Retrieved from https://www.cbs.nl/-/media/imported/documents/2009/07/2009-15-x10-pub.pdf

    Parker, M. (2017). Types of sampling. Retrieved from https://www.ma.utexas.edu/users/parker/sampling/srs.htm

    Sampling. (2017). In Enclyclopaedia Britannica. Retrieved from https://www.britannica.com/science/sampling-statistics

    https://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=9TF8btw1aNU&index=31&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.cbs.nl/-/media/imported/documents/2009/07/2009-15-x10-pub.pdfhttps://www.ma.utexas.edu/users/parker/sampling/srs.htmhttps://www.britannica.com/science/sampling-statistics

  • Fundamentals of Descriptive Analytics 30

    Assignment for Unit 1 - Introduction to Descriptive Analytics

    Go to the National Monthly National Government Cash Operations Report

    (https://data.gov.ph/?q=dataset/national-government-cash-operations-report) and

    download Data Sheets 2011 to 2014. Using the lessons learned in Unit 1, conduct the

    following:

    1. What are the common variables in the data sheets? Identify the level of

    measurement of each variable (5 points).

    2. Randomly select two data sheets from Data Sheets 2011 to 2014. Indicate the

    years of the two data sheets selected and the process of selection employed (5

    points).

    3. Randomly select six out of the twelve months that will be part of the record data

    set. Indicate the months selected and the process of selection employed (5

    points).

    4. Create a table that shows the sum of values of common variables for each of the

    selected years. Explain the type of data set generated (15 points).

    5. Compute for the mean and standard deviation of the data from the selected years.

    Write a description of the results (20 points).

    https://data.gov.ph/?q=dataset/national-government-cash-operations-report

  • Fundamentals of Descriptive Analytics 31

    UNIT II: DATA PREPROCESSING

    This unit intends to:

    1. Introduce basic concepts in data preprocessing; and 2. Introduce methods of data preprocessing.

    MODULE 1: BASIC CONCEPTS IN DATA PREPROCESSING

    Introduction Data preprocessing is an important step in data analytics. It aims at assessing and

    improving the quality of data for secondary statistical analysis. With this, the data is better

    understood and the data analysis is performed more accurately and efficiently.

    Learning Objectives

    After studying this module, you should be able to:

    1. Explain what data preprocessing is and why it is important in data analytics; and

    2. Describe different forms of data preprocessing.

    1.1. What is Data Pre-processing?

    Data in the real world tend to be incomplete, noisy, and inconsistent. “Dirty” data can lead

    to errors in parameter estimation and incorrect analysis leading users to draw false

    conclusions. Quality decisions must be based in quality data; hence, unclean data may

    cause incorrect or even misleading statistical results and predictive analysis. Data

    preprocessing is a data mining technique that involves transforming raw or source data

    into an understandable format for further processing.

  • Fundamentals of Descriptive Analytics 32

    1.2. Tasks for Data Pre-procesing

    Several distinct steps are involved in preprocessing data. Here are the general steps taken to pre-process data:

    1. Data cleaning

    • This step deals with missing data, noise, outliers, and duplicate or incorrect

    records while minimizing introduction of bias into the database.

    • Data is cleansed through processes such as filling in missing values,

    smoothing the noisy data, or resolving the inconsistencies in the data.

    2. Data integration

    • Extracted raw data can come from heterogeneous sources or be in

    separate datasets. This step reorganizes the various raw datasets into a

    single dataset that contain all the information required for the desired

    statistical analyses.

    • Involves integration of multiple databases, data cubes, or files.

    • Data with different representations are put together and conflicts within the

    data are resolved.

  • Fundamentals of Descriptive Analytics 33

    3. Data transformation

    • This step translates and/or scales variables stored in a variety of formats

    or units in the raw data into formats or units that are more useful for the

    statistical methods that the researcher wants to use.

    • Data is normalized, aggregated and generalized.

    4. Data reduction

    • After the dataset has been integrated and transformed, this step removes

    redundant records and variables, as well as reorganizes the data in an

    efficient and “tidy” manner for analysis.

    • Pertains to obtaining reduced representation in volume but produces the

    same or similar analytical results.

    • This step aims to present a reduced representation of the data in a data

    warehouse.

    Pre-processing is sometimes iterative and may involve repeating this series of steps until

    the data are satisfactorily organized for the purpose of statistical analysis. During

    preprocessing, one needs to take care not to accidentally introduce bias by modifying the

    dataset in ways that will impact the outcome of statistical analyses. Similarly, we must

    avoid reaching statistically significant results through “trial and error” analyses on

    differently pre-processed versions of a dataset.

    Learning Resource Watch Dr. Eugene Rex Jalao’s video on Data Preprocessing. https://www.youtube.com/watch?v=qk3gedLrpIU&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=20

    https://www.youtube.com/watch?v=qk3gedLrpIU&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=20

  • Fundamentals of Descriptive Analytics 34

    MODULE 2: METHODS OF DATA PREPROCESSING

    Introduction

    Data preprocessing consists of series of steps to transform data extracted from different

    data sources into a “clean” data prior to statistical analysis. Data pre-processing includes

    data cleaning, data integration, data transformation, and data reduction.

    Learning Objectives After studying this module, you should be able to:

    1. Understand the different methods of data preprocessing; and

    2. Differentiate the different techniques of data preprocessing.

    2.1. Data Integration

    Data integration is the process of combining data derived from various data sources (such

    as databases, flat files, etc.) into a consistent dataset. In data integration, data from the

    different sources, as well as the metadata - the data about this data - from different sources

    are integrated to come up with a single data store. There are a number of issues to

    consider during data integration related mostly to possible different standards among data

    sources. These issues could be entity identification problem, data value conflicts, and

    redundant data. Careful integration of the data from multiple sources may help reduce or

    avoid redundancies and inconsistencies and improve data mining speed and quality of

    sources.

    Four Types of Data Integration Methodologies

    1. Inner Join - creates a new result table by combining column values of two

    tables (A and B) based upon the join-predicate.

    2. Left Join - returns all the values from an inner join plus all values in the left

    table that do not match to the right table, including rows with NULL (empty)

    values in the link column.

    3. Right Join - returns all the values from the right table and matched values

    from the left table (NULL in the case of no matching join predicate).

    4. Outer Join - the union of all the left join and right join values.

  • Fundamentals of Descriptive Analytics 35

    Learning Resource Watch: Dr. Eugene Rex Jalao’s video on Data Integration https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21

    2.2. Data Transformation

    Data transformation is a process of transforming data from one format to another. It aims

    to transform the data values into a format, scale or unit that is more suitable for analysis.

    Data transformation is an important step in data preprocessing and a prerequisite for doing

    predictive analytic solutions.

    https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21https://www.youtube.com/watch?v=EpdIz2uH1aM&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=21

  • Fundamentals of Descriptive Analytics 36

    Here are a few common possible options for data transformation:

    1. Normalization - a way to scale specific variable to fall within a small specific range

    a. min-max normalization - transforming values to a new scale such that all

    attributes fall between a standardized format.

    b. Z-score standardization - transforming a numerical variable to a standard

    normal distribution

  • Fundamentals of Descriptive Analytics 37

    2. Encoding and Binning a. Binning - the process of transforming numerical variables into categorical

    counterparts. i. Equal-width (distance) partitioning

    Divides the range into N intervals of equal size, thus forming a

    uniform grid.

    ii. Equal-depth (frequency) partitioning

    Divides the range into N intervals, each containing

    approximately the same number of samples.

    b. Encoding - the process of transforming categorical values to binary or

    numerical counterparts, e.g. treat male or female for gender to 1 or 0. Data

    encoding is needed because some data mining methodologies, such as

    Linear Regression, require all data to be numerical.

    i. Binary Encoding (Unsupervised)

    Transformation of categorical variables by taking the values 0

    or 1 to indicate the absence or presence of each category.

  • Fundamentals of Descriptive Analytics 38

    If the categorical variable has k categories, we would need to create k binary variables.

    ii. Class-based Encoding (Supervised)

    • Discrete Class

    Replace the categorical variable with just one new

    numerical variable and replace each category of the

    categorical variable with its corresponding probability of

    the class variable.

  • Fundamentals of Descriptive Analytics 39

    • Continuous Class Replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding average of the class variable.

    Learning Resources Watch: 1. Dr. Eugene Rex Jalao’s video on Data Transformation

    https://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    2. Dr. Eugene Rex Jalao’s video on Data Encoding https://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    https://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=ihHGKlAKL_s&index=18&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=wLqJ3HRtC_w&index=22&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

  • Fundamentals of Descriptive Analytics 40

    2.3. Data Cleaning

    All data sources potentially include errors and missing values – data cleaning addresses

    these anomalies. Data cleaning is the process of altering data in a given storage resource

    to make sure that it is accurate and correct. Data cleaning routines attempts to fill in

    missing values, smooth out noise while identifying outliers, and correct inconsistencies in

    the data, as well as resolve redundancy caused by data integration.

    Data Cleaning Tasks:

    1. Fill in missing values

    Solutions for handling missing data:

    a. Ignore the tuple

    b. Fill in the missing value manually

    c. Data Imputation

    - Use a global constant to fill in the missing value

    - Use the attribute mean to fill in the missing value

    - Use the attribute mean for all samples belonging to the same class

    2. Cleaning noisy data

    Solutions for cleaning noisy data:

    a. Binning - transforming numerical values into categorical components

    b. Clustering - grouping data into corresponding cluster and use the cluster

    average to represent a value

    c. Regression - utilizing a simple regression line to estimate a very erratic

    data set

    d. Combined computer and human inspection - detecting suspicious values

    and checking it by human interventions

    3. Identifying outliers

    Solutions for identifying outliers:

    a. Box plot

    Learning Resource Watch: Dr. Jalao’s video on Data Cleaning https://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    https://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=qKC4oPpcbEg&index=23&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

  • Fundamentals of Descriptive Analytics 41

    2.4. Data Reduction and Manipulation

    Data reduction is a process of obtaining a reduced representation of the data set that is

    much smaller in volume but yet produce the same (or almost the same) analytical results.

    The need for data reduction emerged from the fact that some database/data warehouse

    may store terabytes of data, and complex data analysis/mining may take a very long time

    to run on the complete data set.

    Data Reduction Strategies:

    1. Sampling - utilizing a smaller representative or sample from the big data set or

    population that will generalize the entire population.

    A. Types of Sampling

    i. Simple Random Sampling - there is an equal probability of selecting

    any particular item.

    ii. Sampling without replacement - as each item is selected, it is

    removed from the population

    iii. Sampling with replacement - objects are not removed from the

    population as they are selected for the sample

    iv. Stratified sampling - split the data into several partitions, then draw

    random samples from each partition.

    2. Feature Subset Selection - reduces the dimensionality of data by eliminating

    redundant and irrelevant features.

    A. Feature Subset Selection Techniques

    i. Brute-force approach - try all possible feature subsets as input to

    data mining algorithm

    ii. Embedded approaches - feature selection occurs naturally as part

    of the data mining algorithm

    iii. Filter approaches - features are selected before data mining

    algorithm is run

    iv. Wrapper approaches - use the data mining algorithm as a black

    box to find the best subset or attributes

    3. Feature Creation - creating new attributes that can capture the important

    information in a data set much more efficiently than the original attributes.

    A. Feature Creation Methodologies

    i. Feature Extraction

    ii. Mapping Data to New Space

    iii. Feature Construction

  • Fundamentals of Descriptive Analytics 42

    Learning Resource Watch:

    Dr. Jalao’s video on Data Reduction and Manipulation

    https://www.youtube.com/watch?v=-

    JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19

    https://www.youtube.com/watch?v=-JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19https://www.youtube.com/watch?v=-JPopvvngsQ&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&index=19

  • Fundamentals of Descriptive Analytics 43

    MODULE 3: POST-PROCESSING AND VISUALIZATION OF DATA INSIDE THE DATA WAREHOUSE

    Introduction

    Let us now learn how we can post-process and visualize the data inside the data

    warehouse.

    Learning Objectives

    After working on this module, you should be able to:

    1. Understand various techniques used for post-processing of discovered structures

    and visualization.

    3.1. Exercises using R

    First, what is R? R is an integrated suite of software facilities for data manipulation,

    calculation and graphical display.

    It has an effective data handling and storage facility. It also has a large, coherent,

    integrated collection of intermediate tools for data analysis. In addition, it has graphical

    facilities for data analysis and display either directly at the computer or on hard copy.

    Take note that R is not a database but connects to a DBMS. It is not a spreadsheet view

    of data, but it connects to Excel/MS Office.

    R is free and open source though it has a steep learning curve. RStudio IDE is a powerful

    and productive 3rd Party user interface for R. It’s free, open source, and works great on

    Windows, Mac, and Linux.

    Exercises for this session will include the following:

    1. Working with dataset Wage

    2. Studying, reducing and structuring the dataset

    3. Plotting the dataset

    4. Introducing a business analytics task for the dataset

    5. Working with another dataset

  • Fundamentals of Descriptive Analytics 44

    In post-processing, we remember that data extracted from a data warehouse or pieces of

    knowledge extracted from an initial data mining task could be further processed. We can

    simplify the data, apply descriptive statistics, do visualizations or graphing tasks, or

    applying further business analytics tools.

    Watch the "Data Post-processing" video by Raymond Lagria to understand preliminaries,

    data frames, reading data, subsetting, graphing and plotting, and regression analysis in

    R.

    Always take note to transform your dataset into your desired format before applying further

    data mining techniques.

    Study Question If you were a business manager, what types of visualizations for the data warehouse’s

    data would you like to see?

    3.2. Case Study

    Let us continue to see how post-processing and plotting is done with R in the “Data Post-

    processing” Video by Raymond Lagria.

    https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrE

    axWMGlb_h9MVEo6I&t=0s

    References

    https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/

    Data Post-Processing (Slides) by Raymond Lagria

    Data Post-Processing (Video) by Raymond Lagria

    https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrE

    axWMGlb_h9MVEo6I&t=0s

    https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/https://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0shttps://www.youtube.com/watch?v=0fgDbPhegg4&index=86&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I&t=0s

  • Fundamentals of Descriptive Analytics 45

    Assignment for Unit 2 - Data Preprocessing

    Open the bankdata.csv file. The Bank Dataset contains 11 independent variables

    specifically age, region, income, sex, married, children, car, save_act, current_act, and

    mortgage and one response variable which answers the question: “Did the customer buy

    a PEP (Personal Equity Plan) after the last mailing?” with a yes/no response.”

    Using the lessons learned in Unit 2, conduct the following:

    1. Normalize the income variable into a [0,1] scale. (10 points)

    2. Create an equal-depth (frequency) variable for Income where the new variable

    could take in “Low”, “Medium”, and “High” data. (15 points)

    3. With reference to the region and pep variables, create a new numerical variable

    (region_encoded) containing the numerical equivalent of each category of the

    region variable. Replace each category with its corresponding probability of the

    pep variable. (25 points)

    Other References Used for Unit II:

    A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules Jasdeep Singh Malik, Prachi Goyal, 3 Mr.Akhilesh K Sharma 3 Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India. Available at URL https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdf

    Son NH (2006) Data mining course—data cleaning and data preprocessing. Warsaw University. Available at URL http://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdf

    Malley B., Ramazzotti D., Wu J.T. (2016) Data Pre-processing. In: Secondary Analysis of Electronic Health Records. Springer, Cham. Available at URL https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2

    https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdfhttps://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACom10_279_Paper%20(2).pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttp://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdfhttps://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2

  • Fundamentals of Descriptive Analytics 46

    UNIT III: DATA VISULIZATION AND COMMUNICATION

    Introduction Objectives

    1. Define Visualization; 2. Give examples of charts; and 3. Describe what makes an effective Visualization.

    Learning Resources 1. Watch: Teh Beauty of Data Visualization

    https://www.oercommons.org/courses/the-beauty-of-data-visualization/view

    2. Watch: Raymond Freth Lagria “Visualization” Video

    https://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplG

    rEaxWMGlb_h9MVEo6I

    3. Read: MIT Statistics and Visualization for Data Analysis Lecture Notes from OER

    Commons https://www.oercommons.org/courses/statistics-and-visualization-for-data-

    analysis-and-inference-january-iap-2009/view

    1.1. What is Visualization?

    Watch this OER Video from TED-Ed to Understand the Importance of Data Visualizaiton:

    https://www.oercommons.org/courses/the-beauty-of-data-visualization/view

    Visualization is the presentation of information using spatial or graphical representations,

    for the purposes of: comparison facilitation, recognition of patterns and general decision

    making.

    It makes use of the human senses to understand data sets. Seeing things visually allow

    humans to easily notice patterns, trends, and comparisons in such a way that looking at

    raw numbers cannot.

    Some examples are provided in Lagria’s video from [02:47]

    https://www.oercommons.org/courses/the-beauty-of-data-visualization/viewhttps://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=Yu1qoZ6Y9EU&index=88&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.oercommons.org/courses/statistics-and-visualization-for-data-analysis-and-inference-january-iap-2009/viewhttps://www.oercommons.org/courses/statistics-and-visualization-for-data-analysis-and-inference-january-iap-2009/viewhttps://www.oercommons.org/courses/the-beauty-of-data-visualization/view

  • Fundamentals of Descriptive Analytics 47

    1.2. Types of Visualizations

    Data can be visualized in a number of ways. Lagria’s video presented two Types of

    Visualization. First those that are meant to explore and calculate, and then the other is to

    communicate information.

    A Graph is a medium of visualization designed to communicate information. Depending

    on the type of data, there is almost always a suitable graph to use.

    Categorical Data can be visualized through:

    1. Bar Graph

    2. Pie Chart

    3. Pareto Chart

    4. Side-by-side chart

    Numerical Data can be displayed using:

    1. Stem-and-Leaf Display

    2. Histogram

    Charts, on the other hand, typically refer to a visualization medium that shows structure

    and relationship. Some examples are flowcharts and network diagrams. Note that these

    definitions are quite fluid. For example, even though it’s technically a graph, we refer to a

    circle divided into sections showing proportion as a “Pie Chart”.

    Finally, we call schematic pictures or illustrations of objects and entities Diagrams.

    There are many different types of graphs and charts. You can learn more about these in

    Lagria’s video from 06:52, and on the MIT Lecture Notes OER from p.26 to 42.

    1.3. Visual Design Principles

    Lagria’s video (starting from 15:35) described a study by Lohse in 1994. The study

    involved 60 participants and their response to different types of visualizations.

    Some of their findings were:

    1. Simple images are better. Icons were preferred over photographs

    2. Graphs and Tables were the most self-similar categories

    3. Animation is recommended for temporal data.

  • Fundamentals of Descriptive Analytics 48

    One of the important characteristic of data visualization is that it needs to have preattentive

    properties. It means that you will be able to communicate without the need for the viewer

    to pay close attention. This can be likened to glance value. It is determined by some factors

    described in the study, such as eye movement in milliseconds. Color, shape and order

    can also determine how preattentive a visualization is.

    Lagria’s video also presented Tufte’s Principles of Graphical Design Excellence from

    22:22. In this video, he introduced concepts such as Graphical Integrity, Data Ink, Density

    and Lie Factor. These also affect how visualizations are perceived.

    Study Questions

    What is the purpose of visualizing data?

    What are some types of graphs? Give examples that use each.

  • Fundamentals of Descriptive Analytics 49

    UNIT IV: ETHICS IN DESCRIPTIVE ANALYTICS

    This unit intends to:

    1. Familiarize students with possible ethical and legal dilemmas in research;

    2. Familiarize students with ethical and legal guidelines that can be applied to

    descriptive analytics; and

    3. Make students realize the implications of ethical and legal descriptive analytics.

    1.1. Ethics in Descriptive Analytics: Dilemmas and Guidelines

    Many of us are familiar with the scientific method and scientific research. While these

    contribute to the body of knowledge and worldwide advancement, these could also

    compromise people and data when conducted without ethics. Research ethics started

    primarily in the area of health research where human participants serve as guinea pigs in

    clinical trials. Over the decades, scholars from different disciplines realized that ethics is

    not only applicable to health research; it is applicable to everything that requires the use

    of data.

    Why is research ethics important? As the group that trains researchers from all disciplines

    in the Philippines, the Philippine Health Research Ethics Board of the Department of

    Science and Technology argued that research ethics should be embedded in all research

    processes primarily because of the following reasons:

    • It is the right thing to do;

    • It protects research participants;

    • It provides advocates for research participants;

    • It preserves credibility, trust, and accountability;

    • It reduces liabilities, wasted time, and resources;

    • It useless, harmful, worthless to useful helpful and worthy.

    How are these applicable to descriptive analytics? As pointed out by lawyer Emerson

    Banez, descriptive analytics is an activity that’s embedded within society, and such society

    has norms; it is also governed by laws. These could already provide people an idea about

    the right thing to do. Before embarking in a descriptive analytics activity, check the policies

    of the company to be studied. These policies reflect the norms of the company, and these

    have also incorporated the laws that regulate the practices of the industry where the

    company belongs. Ensure that you will not break any of these policies in your conduct of

    descriptive analytics.

  • Fundamentals of Descriptive Analytics 50

    You may also argue based on what you have learned that descriptive analytics and

    business analytics in general do not deal much with human participants who should be

    protected and advocated for. However, it is important to note that data are products of

    human activities. While these are not directly collected from individuals, it still requires

    proper handling and judgment. Lawyer Emerson Banez discussed about the importance

    of one’s lack of bias and discrimination when dealing with data. In the previous units, you

    have learned about sampling and various descriptive statistical tests. To be ethical, one

    must not discriminate the data that should be analyzed. Selection must not be biased

    towards data that could only reflect the outcomes that the researchers desire; these data

    must reflect objectivity in order to guide the company towards better decision-making.

    In addition to the lack of discrimination and bias, lawyer Emerson Banez also pointed out

    the importance of integrity, transparency, and accountability. On the premise that the data

    you used have been compromised, full disclosure must be done to ensure that the

    company will be protected against decision based on bad descriptive analytics.

    Perhaps the most important aspect of ethics in descriptive analytics is privacy. Data must

    not expose the individuals involved in the activities. Data privacy is not only an ethical

    obligation; this is also a law in the Philippines. Also known as the Republic Act 10173, the

    Data Privacy Act recognizes the need for citizens’ data to be protected and secured. These

    meant that consent must be sought from individuals whose activities result in the data that

    will be analyzed in descriptive analytics. One must exercise caution and ensure that no

    privacy is violated by the acquisition, process, and dissemination of data for descriptive

    analytics. Otherwise, the company that used the data will be fined or shut down due to

    legal issues. By being caution, descriptive analytics can guide the company towards

    competitive advantage without liabilities and wasted resources.

    Learning Resources 1. Watch: Atty. Emerson Banez’s video on ethical issues

    https://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    2. Watch: Dominic Ligot’s Ethical Implications of Business Analytics https://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

    https://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=LRn6Nvd6Qqc&index=46&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6Ihttps://www.youtube.com/watch?v=PhDHtc_8nm8&index=64&list=PLiqeNUxu5x2HplGrEaxWMGlb_h9MVEo6I

  • Fundamentals of Descriptive Analytics 51

    References

    Government of the Philippines. (2012). Republic Act 10173 – Data Privacy Act of 2012.

    Retrieved from https://privacy.gov.ph/data-privacy-act/

    Philippine Health Research Ethics Board. (n.d.). An introduction to ethics in research.

    Department of Science and Technology, Taguig City.

    Assignment

    Write a two-page self-reflection on how the course contributed to your understanding of

    descriptive analytics in business.