A Practical Approach To Data Mining Presentation

Preview:

DESCRIPTION

Presented at Project World and World Congress for Business Analysts in ANaheim, Ca. November 2009

Citation preview

A Practical Approach to Data Mining While Maintaining System Performance, Security

and Privacy

Chuck Miller - PMP, SSBB

Project Manager

Prescription Solutions

2

Agenda

Introduction What is Data Mining? Data Mining Tools Common Uses Meaningful Data Roadblocks System Performance Stability Security Privacy and Ethics Knowledge Exercise Resources

3

Introduction

In today’s world security and privacy have become very large concerns, especially when it involves data. Those concerns along with system performance can greatly impact your ability to gather meaningful data.

Whether you are an analyst just starting out or a seasoned veteran looking for a refresher the following information will help you gain a clearer understanding of what data mining can do and provide you with the keys to unlocking your analytical potential.

4

What Data Mining is…

In its simplest form data mining is the process of extracting hidden patterns from data.

Data Mining is considered to be proactive as it allows you to utilize historical information to predict future trends.

5

What Data Mining is… (cont.)

Data mining commonly involves four classes of tasks or techniques:

Classification - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network.

Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.

6

What Data Mining is… (cont.)

Regression - Attempts to find a function which models the data with the least errors.

Association rule learning - Searches for relationships between variables. For example a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can determine what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as "market basket analysis". Very commonly used today online to track your purchases and suggest other items you may be interested in. Amazon.com is a prime example.

7

What Data Mining is Not…

The following terms are often referred to as data mining but are actually data mining tools.

Data Warehousing SQL/AD-Hoc Queries Reporting Data Visualization/Dashboards Online Analytical Processing (OLAP)

8

Data Mining Tools

Most data mining tools can be classified into one of three categories:

Traditional data mining tools

Dashboards

Text-mining tools

9

Data Mining Tools - Traditional

Traditional data mining programs help you establish data patterns and trends by using a number of complex algorithms and techniques as outlined in the previous slides. These tools come in a myriad of programs and outputs. They are generally available for any operating system. In addition, while some may concentrate on one database type, most will be able to handle any data using OLAP or a similar technology.

10

Data Mining Tools - Dashboards

Dashboards are installed in computers to monitor information in a database and reflect data changes and updates onscreen. These are very popular as the graphical representation makes it very easy to spot trends.

11

Data Mining Tools - Text

The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text. Microsoft Word, Acrobat PDF and simple text files are just a few types. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications.

These are useful as scanned content can be unstructured (i.e., information is scattered almost randomly across the document, including e-mails, Internet pages, audio and video data) or structured (i.e., the data's form and purpose is known, such as content found in a database).

12

Common Uses of Data Mining…

Market Basket Analysis - Customers are very likely to purchase shampoo and conditioner together, so a retailer would not put both items on promotion at the same time. The promotion of one would likely drive sales of the other.

Direct mail marketing - A company determines based on data mining, who is likely to be interested in a particular product or promotion. They then use that information to send mail or email to that targeted audience. This gives a much higher return on investment.

13

Common Uses of Data Mining… (cont.)

Credit card fraud detection - Have you ever received a phone call from your credit card company after making a purchase asking if it was you who made the purchase? This is because based on your purchasing trends which were modeled using data mining, you bought something that was outside your model.

Bioinformatics - Mapping the human genome and creating modeling sequences.

14

Example - Manufacturing

15

Example - Bioinformatics

16

Example – Customer Service

17

What is Meaningful Data?

Meaningful or useful data is data that you are relatively certain contains the information which you are mining. Mined data must still be interpreted for relevancy.

To ensure the data is meaningful you need to create validation rules. Data validation can run from the simplest; verifying

characters come from a valid data set. To the complex; automated programs that check data against detailed specific criteria.

18

What is Meaningful Data? (cont.)

There are 2 terms commonly used with data types, positive and negative.

Positive Data – Is the most common and is used as discussed previously for forecasting or predicting future results and behavior.

19

What is Meaningful Data? (cont.)

Negative Data - Are anomalies or discrete events that can skew your results.

For example, a one time promotion of a product occurred and will never happen again. Including this item in your model will throw it off because if not for that promotion, your customer would never have purchased it.

20

How to Navigate Around Data Access Roadblocks

Hi-level buy in

It is always helpful to have executive support when access to data is required.

Look to tie your need in with a corporate

initiative.

21

How to Navigate Around Data Access Roadblocks (cont.)

Explain what data mining is.

Site some examples from companies similar to yours and the results they have produced with data mining.

ABC Co. has increased sales in this segment quarter to quarter by implementing sales suggestions on their website.

22

How to Navigate Around Data Access Roadblocks (cont.)

Take a small data set and prove what the benefits are.

This is usually the most difficult because you need access to enough meaningful data to create a model.

23

How to Navigate Around Data Access Roadblocks (cont.)

Suggest limited, timed access to the data.

It is always better to have the most current data, but if you are mining against monthly results you may only need access once the monthly cycle is complete. You have a window to activate the data mining and then you have your data within your chosen tool and can utilize it.

24

Data Mining vs. System Performance

Never conduct data mining during peak operating hours.

Conduct a sample run on a smaller subset of data to check run time and performance degradation.

Conduct data mining on a backup copy or read only version of the databases.

25

How to Maintain Data Stability

Limit access to data.

Always backup your data.

It is preferable to use a backup or read-only copy of data.

26

Security - Internal

Do the people running the data mining have clearance to access the systems?

Do the people reviewing that data have clearance or the need to know that information?

By mining multiple databases does the assembled information violate any security policies?

27

Security - External

Limit access to data.

Eliminate external references.

Summarize the data.

Data masking.

28

Privacy Concerns and Ethics

Some people believe that data mining alone is ethically neutral. However, the ways in which data mining can be used can raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes but this is also applicable to every company no matter how large or small, that collects customer information.

29

Privacy Concerns and Ethics (cont.)

Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation.

Data aggregation is when the data is accrued, possibly from various sources, and put together so that it can be analyzed. This is not data mining per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to an individual's privacy comes into play when the data, once compiled, causes the data miner, or anyone who has access to the newly-compiled data set, to be able to identify specific individuals, especially when originally the data was anonymous.

30

Privacy Concerns and Ethics (cont.)

Data Aggregation example Amazon displays items frequently purchased together on

their website. This information alone is okay. However to get this information you have to scan all customer records from multiple vendors that use your website to sell products and combine them into a working model. If while compiling this information it includes the individual customers who purchased it there is then a privacy issue.

31

Privacy Concerns and Ethics (cont.)

It is recommended that an individual is made aware of the following before data is collected. The purpose of the data collection and any data mining

projects. How the data will be used. Who will be able to mine the data and use it. The security surrounding access to the data. How collected data can be updated. One may additionally modify the data so that it is

anonymous, so that individuals may not be readily identified.

32

Privacy Concerns and Ethics (cont.)

Does this violate privacy, ethics, both or neither?

33

Suggested Reading

Competing on Analytics by Tom Davenport and Jeanne Harris (Hardcover / 2007) This book focuses on the challenges of getting an

organization to change its approach to problem solving, by increasing the use of analytics across a business.

Data Mining Explained by Delmater and Hancock (Paperback / 2001) Many of the data mining books focus on the technology

rather than the impact on the business process.  This book provides a good introduction to data mining as well as a good  discussion of the business impact that can be felt throughout an organization. 

34

Resources

Web: www.thearling.com www.kdnuggets.com

Print: Basic Statistics – Tools for Continuous Improvement by

Mark J. Kiemele (Hardcover / 1997)

35

Questions

36

Thank You

If you have any follow up questions, I can be contacted at

miller_ch@prescriptionsolutions.com

Recommended