50
Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Embed Size (px)

Citation preview

Page 1: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Lecture 1: Data Science &

Data Engineering

CS 6071

Big Data Engineering, Architecture, and Security

Fall 2015, Dr. Rozier

Page 2: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Homework 1

• Data Structures and Basic Programming• Due: September 1st at the beginning of class

This assignment must be completed individually. It is worth 25% of your homework grade for the semester, and should help you judge if you are prepared for the course.

Page 3: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Homework 2

• Presentations on Biomedical Data Science• Due: September 10th during class

You will be divided into five groups, one at NG Xetron, four at UC.

Each group will read an assigned article and prepare a 10 minute presentation on the topic for the rest of the class.

Page 4: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Big News about Sanders

Page 5: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Data Science and Engineering

Page 6: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

From the Information Age to the Data Age

Page 7: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

What is Data Science?

Page 8: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

What is Data Engineering?

Page 9: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Drew Conway’s Venn Diagram of Data Science

Page 10: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Foundations of Data Science

• Statistics

• Computer Science

• Domain Expertise

Page 11: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier
Page 12: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier
Page 13: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Doing Data Science

Page 14: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Back to Bernie and Clinton…

Page 15: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Problems with Anecdotal Data

• Small number of observations

• Selection bias

• Confirmation bias

• Inaccuracy

Page 16: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Some Basic Definitions

• Population – the set of objects or units to be measured.

Page 17: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Some Basic Definitions

• Population – the set of objects or units to be measured.

• Observations – extracted or measured characteristics about the objects.

Page 18: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Some Basic Definitions

• Population – the set of objects or units to be measured.

• Sample – the subsetof objects examinedin order to drawconclusions and makeinferences about thepopulation.

Page 19: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example

• Let’s say we want to infer information about the quality of students admitted to UC.

• Define the population, a single observation, and a sample.

Page 20: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example

• Let’s say we want to infer information about the quality of students admitted to UC.

• How might we introduce biases into the data?

• What might the consequences be?

Page 21: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Estimating the e-mail generated by employees

• Bearcats Health Insurance Inc has hired you to help them understand their e-mail traffic. They have 5,000 employees, and it is infeasible to capture all mailing records. They have asked you to evaluate a possible method for sampling:– Select 10% of their employees at random, and

sample all e-mail they have ever sent.

Page 22: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Estimating the e-mail generated by employees

• Bearcats Health Insurance Inc has hired you to help them understand their e-mail traffic. They have 5,000 employees, and it is infeasible to capture all mailing records. They have asked you to evaluate a possible method for sampling:– Select 10% of all e-mail sent during the day at

random.

Page 23: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

But this is the age of BIG DATA!

• Why not just sample every message?

Page 24: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measurements

• Measurements have inherent assumptions• Measurements are often stated very

informally

– Formalize our measures!

Page 25: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measurements

Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details, but the details do exist and for good reasons. - Maya Gupta, University of Washington

Page 26: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Problem of Measures

• Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points.

• Let’s take two bodies on the real number line– Body A is the line A = [0, 1]– Body B is the line B = [0, 2]

Which is “longer”?

Page 27: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Problem of Measures

• Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points.

• Let’s take two bodies on the natural number line– Body A is the line A = [0, 1]– Body B is the line B = [0, 2]

Which is “longer”?

Page 28: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Solving the Problem of Measures

• What does it mean for some body (or subset)

to be measurable?

• If a set E is measurable, how does one define its measure?

• What properties or axioms does measure (or the concept of measurability) obey?

Page 29: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measure Theory

• Before we can measure anything we need something to measure!

• Let’s define a measurable space– A measurable space is a collection of events B, and

the set of all outcomes, Ω, also called the sample space.

Page 30: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Events and Sample Spaces

• Each event, F, is a set containing zero or more outcomes.– Each outcome can be viewed as a realization of an

event. The real world can be viewed as a player in a game that makes some move:

– All events in F that contain the selected outcome are said to “have occurred”.

Page 31: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Events and Sample Space

• Take a deck of 52 cards + 2 jokers

• Draw a single card from the deck.

• Sample space: 54 element set, each card is a possible outcome.

• An event is any subset of the sample space, including a singleton set, or the empty set.

Page 32: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Events and Sample Space

• Potential events:– “Red and black at the

same time without being a joker” – (0 elements)

– “The 5 of hearts” – (1 element)

– “A king” – (4 elements)– “A face card” – (12

elements)– “A card” – (54 elements)

Page 33: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Forming an Algebra on B and Ω

• In order to define measures on B, we need to make sure it has certain properties, those of aσ-algebra.

• A σ-algebra is a special kind of collection of subsets that is closed under countable-fold set operations (complement, union of countably many sets, and intersection of countably many sets).

• “Vanilla” algebras are closed only under finite set operations.

Page 34: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Countable Sets

• Countable sets are those with the same cardinality of natural numbers.

• Quick refresher: Prove the cardinality of integers and natural numbers are the same.

Page 35: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

σ-algebra

• If we have a σ-algebra on our sample space Ω, then:

Page 36: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measures

• A measure µ takes a set A from a measureable collection of sets B and returns the measure of A, which is some positive real number.

Formally:

Page 37: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example Measure• Let’s define a measure of “Volume”.

• The triple combines a measureable space and a measure, the triple is called a measure space. This space is defined by two properties:– Nonnegativity:– Countable additivity: are disjoint

sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Page 38: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example Measure

• Does the ordinary concept of volume satisfy these two properties?

– Nonnegativity:– Countable additivity: are disjoint

sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Page 39: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Two Special Kinds of Measures

• Signed measure – can be negative• Probability measure – defined over a

probability space with a probability measure.– A probability measure, P, has the normal

properties of a measure, but it is also normalized such that:

Page 40: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Sets of Measure Zero

• A set of measure zero is some set

• For a probability measure, any set of measure zero can never occur as it has probability of zero. – It can thus be ignored when stating things about

the collection of sets B.

Page 41: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• A common σ-algebra is the Borel σ-algebra. A Borel set is an element of a Borel σ-algebra.– Almost any set you can describe on the real line is

a Borel set, for example, the unit line segment [0,1]. Irrational numbers, etc.

– The Borel σ-algebra on the real line is a collection of sets that is the smallest σ-algebra that includes the open subsets of the real line.

Page 42: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• For some space X, the collection of all Borel sets on X forms a σ-algebra known as the Borel algebra (or Borel σ-algebra) on X.

• Important!

• Why? Any measure defined on the open set of a space, or closed sets of a space, must also be defined on all Borel sets of that space.

Page 43: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• Borel sets are powerful because if you know what a probability measure does on every interval, then you know what it does on all the Borel sets.

• Allows us to define equivalence of measures.

Page 44: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Borel Sets

• Let’s say we have two measures: • To show they are equivalent we just need to show

that:– They are equivalent on all intervals

• By definition they are then equivalent for all Borel sets, and hence over the measurable space.

• Example: Given probability distributions A, and B, with equivalent cumulative distribution functions, then the probability distributions must also be equal.

Page 45: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measure Theory and Data Science

• Data Science is about working with, and deriving observations or features from data.

• Features are effectively measures of some sort, but often not for the underlying space of interest.

• Important to realize the limitations of measurable spaces for metrics of interest, and what can and cannot be measured.

Page 46: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Example

Bearcats Elementary School had 300 students in their 5th grade class. 77% of them graduated to middle school. 12% failed their mathematics Standards Of Learning, 11% failed their reading Standards of Learning.The new class of 1st graders had interventions in mathematics and grammar, their graduation rates improved to 88%, with 7% failing mathematics, and 5% failing reading.

What can we infer? How does measure theory relate?

Page 47: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Measure Theory: Further Reading

• M. Capinski and E. Kopp, “Measure, Integral, and Probability”, Springer Undergraduate Mathematics Series, 2004

• S. I. Resnick, “A probability path”, Birkhauser, 1999.

• A. Gut, “Probability: A Graduate Course”, Springer, 2005.

• R. M. Gray, “Entropy and Information Theory”, Springer Verlag (available free online), 1990.

Page 48: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

The Data Science Pipeline

• Metric identification• Data collection• Data exploration and summary statistics• Feature generation• Feature importance testing• Modeling• Validation

Page 49: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Automating the Data Pipeline

Drake – Like make for data.

Page 50: Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

For next time

• Homework 1• Due this Tuesday!!!