19
Fall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

  • Upload
    others

  • View
    10

  • Download
    2

Embed Size (px)

Citation preview

Page 1: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Fall 2017

CptS 483:04 Introduction to Data Science

What Is Data Science?

Assefaw Gebremedhin

Page 2: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

What is Data Science?

• Big Data and Data Science hype •  and getting past the hype

• Why now? • Current landscape of perspectives • Skill sets needed

Page 3: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Big Data and Data Science Hype

What might be eyebrow-raising about Big Data and Data Science?

•  Lack of definition around basic terminology •  Lack of recognition for researchers in academia and industry

who have been working on this kind of stuff for years •  The hype is crazy •  Statisticians might perceive this whole movement as an

identity theft •  Some say “anything that has to call itself a science isn’t”

Source: Doing Data Science (O’Neil & Schutt, 2013).

Page 4: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Getting past the hype

Around all the hype, there is a ring of truth Data Science is something new – it has access to a larger body of knowledge and methodology as well as a process that has foundations in both statistics and computer science. [DDS, O’Neil and Schutt]

We are here in this course to understand this better and contribute to the ongoing pursuit of a sharper definition.

Page 5: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015)

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

John Hopcroft

Page 6: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Quote from Intro of “Foundations of Data Science” manuscript by Avrim Blum, John Hopcroft and Ravindran Kannan (2015)

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods. John Hopcroft

Page 7: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Why Now? Enablers of today’s “big data revolution”

•  Proliferation of sensors •  Creation of almost all information in digital form

•  Datafication •  Dramatic cost reduction in storage

•  You can afford to keep all the data •  Dramatic increases in network bandwidth

•  You can move the data to where it is needed •  Dramatic cost reduction and scalability improvements in

computation •  Dramatic algorithmic breakthroughs

•  Machine Learning, Data Mining, Fundamental advances in CS and Statistics

•  Ever more powerful models producing ever increasing volumes of data that must be analyzed

Page 8: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Current landscape (of perspectives)

Example 1. Metamarket CEO Mike Driscolli (on Quora discussion from 2010 on “What is Data Science”):

Data Science, as practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics. But data science is not merely hacking—because when hackers finish debugging their Bash one-liners and Pig scripts, few of them care about non-Euclidean distance metrics. And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a tab-delimited file into R if their job depended on it. Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.

Page 9: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Current landscape (of perspectives) Example 2. Drew Conway’s Venn diagram of DS (2010)

Page 10: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Current landscape (of perspectives)

Example 3. Vasant Dhar, in the article “Data Science and Prediction”, Communications of the ACM, Dec 2013, makes the following three big points: http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext •  Data Science is the study of the generalizable extraction of knowledge from

data. •  A common requirement in assessing whether new knowledge is actionable for

decision making is its predictive power, not just its ability to explain the past. •  A data scientist requires an integrated skill set spanning math, ML, statistics,

computer science, along with a deep understanding of the craft of problem formulation to engineer effective solutions.

Page 11: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

A Figure from Dahr’s article: Projected growth rate of unstructured and structured data

Page 12: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

A Data Science Profile

• Computer science • Math •  Statistics • Machine Learning • Domain expertise • Communication and presentation skills • Data visualization

Page 13: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Author Schutt’s data science profile

Page 14: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

So What Is a Data Scientist, Really?

•  In Industry: •  A data scientist is someone who knows how to extract meaning from and

interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. •  She spends a lot of time in the process of collecting and cleaning data. This

process requires persistence, statistics, and software engineering skills. •  Once she gets the data into shape, a crucial part is exploratory data analysis,

which combines visualization and data sense. •  She will find patterns, build models, and algorithms – some with the intention

of understanding product usage and others to serve as prototypes that ultimately get baked back into the product. •  She may design experiments, and she is a critical part of data driven decision

making. •  She will communicate with team members, engineers and leadership in clear

language and with data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications.

Page 15: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

So What is a Data Scientist, Really?

•  In Academia: •  An academic data scientist is a scientist, trained in anything from social

science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real world problem.

Page 16: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Speaking of data scientist in academia…

EXERIMENTAL THEORETICAL COMPUTATIONAL (Simulation)

The 4th PARADIGM (Data) connectedness

Page 17: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Impact

Page 18: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Impact of data science (with emphasis on networked-data) •  Economic

•  Web search •  Social networking

•  Health •  Drug design •  Metabolic engineering

•  Security •  Fighting terrorism (net-war)

•  Epidemics •  Epidemic prediction (biological, electronic viruses)

•  Brain Research •  NIH has initiated the Connectcome project, aimed at developing a

neuron-level map of mammalian brains

•  Management •  Uncovering the internal structure of an organization

•  Learning •  …

Page 19: CptS 483:04 Introduction to Data ScienceFall 2017 CptS 483:04 Introduction to Data Science What Is Data Science? Assefaw Gebremedhin

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Google!Market Cap(2010 Jan 1): "$189 billion! Cisco Systems!networking gear Market cap (Jan 1, 2919): "$112 billion!

Facebook!market cap: "$50 billion!"www.bizjournals.com/austin/news/2010/11/15/facebooks... - Cached!

Economic Impact

Slide credit: Barabasi (Network Science)