28
03/22/22 CSE/CBS 572: Data Mining by Huan Liu 1 CSE/CBS 572 Data Mining Huan Liu, CSE, CEAS, ASU http://www.public.asu.edu/~huanliu/DM07S/cse572.html

CSE/CBS 572 Data Mining

Embed Size (px)

DESCRIPTION

CSE/CBS 572 Data Mining. Huan Liu, CSE, CEAS, ASU http://www.public.asu.edu/~huanliu/DM07S/cse572.html. CSE 572. Contents of basic and advanced topics Classification, Clustering, Association, and Applications - PowerPoint PPT Presentation

Citation preview

Page 1: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

1

CSE/CBS 572 Data Mining

Huan Liu, CSE, CEAS, ASUhttp://www.public.asu.edu/~huanliu/DM07S/cse572.html

Page 2: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

2

CSE 572Contents of basic and advanced topicsClassification, Clustering, Association, and

Applications

Format – An interactive and hands-on course with ample opportunities to work, create and sharePaper reading, discussion, project,

presentation, or any learning activities you can suggest

Assessment in various waysClass participation, assignments, quizzes, a

course project, presentations, 1 or 2 exams

Page 3: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

3

You: our future data mining star, a potential zillionaire TA: Lei Tang, [email protected]: Huan Liu, [email protected] Where: Brickyard 566 When: see on the course website, or by

appointment“No pain, no gain”, or “As you sow, so you shall reap”. We will also learn the principle of “No Free Lunch”. MyASU will be used, make sure won’t miss important announcement and your email address stays current (I push some information to you via emails sometimes)

Page 4: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

4

Course FormatWhat is the effective teaching of graduate data mining ? Your feedback is keenly sought. Current research papers - the main categories to be found on the course web site.You can choose one of the textbooks listed. It is an entering point for you to access related subjects. The truth is It is a fast changing field. Everyone is expected to read research papers and participate in class discussion.Paper presentations.Project presentations. Presentations will also be evaluated in class.

Page 5: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

5

Point distribution (tentative)

Projects (20-25%)Reading/presentation assignment (5%)Exam(s) (50%)Assignments (20-25%), and class participation, quizzes (up to 10% extra credit)Late penalty, YES, increasing exponentially. Academic integrity (http://www.public.asu.edu/~huanliu/conduct.html)

Page 6: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

6

Research paper readingWe will provide a reading list and you can also choose your favoriteAll are expected to search for and read the selected papers – part of the learning process.

What is it about (e.g., key idea, basic algorithm)? What are points to discuss and improve? What can we do with it?

What to submit? (see more on the class website) A brief report that describes the above and 2 questions suitable

for quizzes/tests with solutions A set of presentation slides for 20 minutes Due date: TBA, use digital drop box

Grading criteria include (1) quality of additional papers you select, (2) slides for presentation, (3) the report, and (4) oral presentation will be selected among the best submissions and presenters will be given extra credit based on presentationPresentation can start as early as in Feburary, if possible.

Page 7: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

7

ProjectProposal

A theme-based project for all in class? More discussion later

Proposal presentation, discussion, revision A project worth the effort of a semester’s work

Progress reportFinal report

Class presentation and/or demo

One key goal of this course is to take advantage of your intelligence and (limited) experience (so you’re bold and creative) to expand your knowledge in creating something useful and interesting

Page 8: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

8

Topic Distribution (tentative)

Topics Classes Introduction 2 Classification 5 Evaluation 2 Pre-processing 3 Clustering 4 Association 4 Novel Applications 4 Project related 4 Real-World Applications 2

Page 9: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

9

Categories of interests (including design and implementation)

1. Data and application securityData mining and privacy

2. Data reduction and selectionStreaming data reductionDealing with large data (column- & row-wise)Search bias, overfitting

3. Learning algorithmsEnsemble methodsSemi-supervised learningActive learning and co-training

4. Bioinformatics or others A discussion board will be created, if needed

Page 10: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

10

Your first assignment is to think

1. Think about what you want to accomplish in this course.

2. List 2 of your areas of interests (don’t be restricted by the previous list, and this is one of some rare opportunities that allow you to day-dream and earn grade points), and briefly justify your choices.

3. Pick an area of interests and choose a general topic for paper presentation.

Submission is in hardcopy4. Complete the above and submit it in the

3rd class (Monday, January 29, 2007).

Page 11: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

11

Your 2nd homework (project related) assignment

Find some (2-3) Web 2.0 or HealthCare applications Wikipedia, some examples: facebook, del.icio.us,myspace,

digg, … Massive data of various types in a hospital environment

Surf the Web and compile their URLs and functionsChoose your favorite one and explain why

How can data mining help in your perspective with your limited knowledge of data mining

Test-drive it (manually, imagine that you have what you wish for, …) Write a summary about your experience, challenges you encountered, and suggestions if anyDue on Wednesday February 7, 2007

Page 12: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

12

Selecting a paper of your interest

First, continue from your 1st (and/or 2nd) homework assignments about your category of interestSecond, you may form presentation groups (e.g., 2-3 a group for discussion purposes)Third, each student picks a paper from the given category and find 1 additional high-quality relevant paper

Submit it through myASU The first student in each category will present the given

paper (s/he does not need to look for another paper) TA will organize and help you and compile a list of all papers

at the endWrite a summary for your selected paper including

What is it about Why is it significant and relevant Where is it published and when

Page 13: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

13

IntroductionThe need for data mining in the Internet era Data everywhere, examples?

Data mining Text mining Image mining Web mining (log, link, content, blog) Bioinformatics

Many products and abundant applications Where do we stand

Page 14: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

14

What is data mining

Data mining is extraction of useful patterns from

data sources, e.g., databases, texts, web, image.

the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

Page 15: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

15

Patterns (1)Patterns are the relationships and summaries derived through a data mining exercise.Patterns must be: valid novel potentially useful, or actionable understandable

Page 16: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

16

Patterns (2)

Patterns are used for prediction or classification describing the existing data segmenting the data (e.g., the market) profiling the data (e.g., your

customers) Detection (e.g., intrusion, fault,

anomaly)

Page 17: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

17

Data (1)Data mining typically deals with data that have already been collected for some purpose other than data mining.Data miners usually have no influence on data collection strategies. Large bodies of data cause new problems: representation, storage, retrieval, analysis, ...

Page 18: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

18

Data (2)Even with a very large data set, we are usually faced with just a sample from the population.Data exist in many types (continuous, nominal) and forms (credit card usage records, supermarket transactions, government statistics, text, images, medical records, human genome databases, molecular databases).

Page 19: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

19

Typical DM tasksClassification: mining patterns that can classify future data

into known classes.

Association rule mining: mining any rule of the form X Y, where X

and Y are sets of data items.

Clustering: identifying a set of similar groups in the

data

Page 20: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

20

Sequential pattern mining:A sequential rule: A B, says that event A will be

immediately followed by event B with a certain confidence

Deviation/anomaly/exception detection: discovering the most significant changes in data

Data visualization (or visual analytics): using graphical methods to show patterns in data.High performance computingBioinformatics

Page 21: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

21

Why data miningRapid computerization of businesses produces huge amounts of dataHow to make best use of data to our advantages?A growing realization: knowledge discovered from data can be used for competitive advantage and to increase business intelligence.There are problems that might not be suitable for data mining – Top 10 Statistics Problems for CapitalOne (Bill Khan’s invited talk at SIGKDD’06)

Page 22: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

22

Make use/sense of your data assetsMany interesting things you want to find cannot be found using database queries“find me people likely to buy my products”“Who are likely to respond to my promotion”

Fast identify underlying relationships and respond to emerging opportunities

Page 23: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

23

Why now and for the near future

The data is abundant.The data is being collected or warehoused.The computing power is affordable.The competitive pressure is increasing.Data mining tools have become available.New challenges New data types evolve New applications emerge

Page 24: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

24

DM fieldsData mining is an emerging multi-disciplinary field:StatisticsMachine learningDatabasesVisualizationOLAP and data warehousingHigh-performance computing...

Page 25: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

25

SummaryWhat is data mining?KDD - knowledge discovery in databases: non-

trivial extraction of implicit, previously unknown and potentially useful/actionable information

Why do we need data mining?Wide use of computer systems - data explosion -

knowledge is power – but we’re data rich, knowledge poor – useful, understandable and actionable knowledge ...

Data mining is not a plug-and-play, so we are not done yet and need to continue this class …

Page 26: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

26

An Overview of KDD Process (Guess which is which)

Page 27: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

27

Web mining – an application

The Web is a massive databaseSemi-structured dataXML and RDFWeb mining Content Structure Usage Link analysis

Page 28: CSE/CBS 572 Data Mining

04/19/23 CSE/CBS 572: Data Mining by Huan Liu

28

About projectData collection How, where, what to collect

Evaluation methods How and why

Analysis and mining What requires our (yet-to-develop) expertise For what

Potentially useful methods For example, graph theories, social networks

Great applications